Announcement

Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.
  1. CPD (Copy Paste Detector) from PMD project.
  2. Simian from RedHill Consulting.
I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are
  1. Lexers (to generate tokens by parsing the source code)
  2. good string matching algorithm to detect the duplicates.
I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server
  • Duplicates found in Apache httpd server source using CDD
    Number of files analyzed : 865
    Number of duplicate found : 161
    Time required for detection : 3 min 30 seconds

  • Duplicate found in Apache Tomcat source code using CDD
    Number of files analyzed : 1774
    Number of duplicate found : 436
    Time required for detection : 19 min 35 seconds
Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.

4 comments:

ensonic said...

Tried this today. Thanks - nice tool! I filed a few reports on the code.google project site and hope you can make the changes.

I wonder if it would be worth to cannonicalize the variables and literals in the snippets to get more hits. I mean if there is strcpy(str1, str2); and strcpy(str1, test); that is not a duplicate. But if you replace all vars by v1,v2,v3.... then if could catch more copies, where code structure is the same, but vars have been changed.

Nitin Bhide said...

Thanks. I will look at the bug reports.

about catching same code structure but different variable names, I tried initial but could not find a good way at that time. Will give a one more try.

Yogesh said...

I think that AST/ASG can also be used for detecting duplicate. It will be interesting to see how we can use it. It might be more useful to hack into compilers (such as gccxml) and produce different statistics.

Nitin Bhide said...

Yogesh,
AST/ASG can definitely be used for detecting duplicates. But, it is far more challenging to hack on the compiler.

A standalone tool which works on text files (source code) has its advantages and disadvantages. It is simpler to make it work on multiple languages. However it may not be able to detect some deep rooted duplicates that AST/ASG approach will be able to detect.

Also compiler used changes based on language, platform, user prefernece etc. So adding detection on GCC may not work if user is using MSVC for his work. For dynamic languages like Ruby, Python ASG/ASG based duplication check is another challenge.

In average project, using simple tools like CDD or CPD (Copy Past Detector from PMD project) large number of duplicates are detected. So the cost/benefit ratio works in simpler tools favor.