Announcement

Sunday, November 01, 2009

Installing (K)Ubuntu as guest os in Virtualbox

So far I mainly used Windows machines as development machines as well at home. Recently I have started experimenting with Ubuntu linux. Hence I wanted to install Ubuntu as Guest os on VirtualBox.  However I faced few problems (like when maxium screen size was set to 800x600 and it took some time to figure out how enable screen size change). Hence i wrote down the steps that worked for me.

This is configuration that i used.
    Host OS : Windows Vista
    Guest OS : Kubuntu 8.1
    Virtualbox version 3.0

Install Kubutu in VirtualBox
  1. Create a new VM (lets name it 'Kubuntu') in the Virtualbox
  2. Mount CDROM drive. 3. Put the Kubuntu CD in the drive OR Mount Kubuntu ISO image as drive
  3. Start new VM (Kubuntu)
  4. Kunbuntu Install CD will be detected. Proceed with installation.
  5. Basic Kunbuntu installation is done.
  6. Now you should be able to 'boot' to Kubuntu.
Now you need to install Guest Additions for this guest OS so that we can  share files across host and guest os, change screen size etc.

Installing Guest Additions
  1. Goto 'VirtualBox->Device->Install Guest Additions'.
  2. If there is already mounted CDROM drive, this will not work. Unmount the CDROM drive.
  3. Then Goto menu 'VirtualBox->Device->Install Guest Additions' again. Select the 'VBoxGuestAdditions.iso' image and mount as device.
  4. You will get a 'device list' in the Kubuntu guest os.
  5. Select the 'virtualbox' device for installing the 'guest additions'.
  6. In the Kubuntu, run 'konsole' to start the terminal window
  7. run 'cd /media/cdrom0' (or wherever the cdrom VBoxGuestAdditions.iso is mounted to)
  8. run 'sudo sh ./VBoxLinux-x86.run'. This will install Guest Addition kernel modules.
  9. Reboot the guest os.
  10. Now guest additions are installed.
Reconfiguring Display size
  1. By default display size is set to 800x600 under VM. That is the maximum size set.
  2. Shut down the guest os and exit from Virtualbox.
  3. Goto virtualbox installation directory in the Host OS(Run 'cd ')
  4. Run 'VBoxManage setextradata global GUI/MaxGuestResolution 1200,800' from command line. Replace '1200,800' by whatever resolution supported by your computer.
  5. Start Virtual box and start the Kubuntu VM.
  6. Click 'Machine->Auto Resize Guest Display' menu
  7. Now Minimize/Maximize VM Window. Internal Kubuntu desktop window will now get resized.
  8. In Kubuntu, goto Application Laucher 'system settings->display'. Change the display size to the size you want and apply.
  9. Now next time when you start Kubuntu VM, you may just have to do the step 6 and 7.
Installing Development setup

  1. Run 'sudo apt-get install linux-kernel-headers'
  2. Now your basic compiler setup is done.
  3. You may want to download other softwares like CodeBlocks IDE.
  4. Run 'adept' package manager.
  5. Search 'Code blocks', Python etc. Click on 'Install' and then select 'Apply Changes' from Adept menu.
Now your setup is done. Create a 'Snapshot' of current VM. So that you can revert to known setup if required.

Sharing Files between windows host and linux guest.

  1. On Windows, create folder (e.g. c:\shared).
  2. Goto VirtualBox->Devices->Shared Folder and add the created folder. Give a name to the folder (e.g. 'shared')
    OR
    Run command 'vboxmanage sharedfolder add "Kubuntu" -name "shared" -hostpath "c:\shared"'  Kubuntu is VM name.
  3.  Goto Guest os and run following commands
    cd $HOME
    mkdir shared
    sudo mount -t vboxsf shared $HOME/shared

Now you can access the 'shared' folder from both windows and linux
  

If these steps help you or if you find any mistakes/want to suggest any improvements, please write a comment.

Sunday, August 09, 2009

Implementing a Well behaved Singleton - 4

In part-3 of this series, I talked about implementing singleton such that construction sequence is guaranteed and the destruction is guaranteed however the sequence in which the destruction happens is not guaranteed.

Sometimes one singleton object depends on another singleton object. In such cases, the technique described in part-3 will be useful to ensure that the singleton's are constructed in the correct order. However, since it uses the 'function static objects' , the destruction sequence is compiler dependent. Suppose Singleton object Foo depends on singleton object Bar, this technique will ensure correct order of construction. However, compiler may call the destructors of Foo and Bar in the wrong order, since compiler is not aware of this dependency.

To fix the destruction sequence we will take the help of little known function called 'atexit'. You can find the details of 'atexit' here. or in the compiler help. The important property is If more than one atexit function has been specified by different calls to this function, they are all executed in reverse order as a stack, i.e. the last function specified is the first to be executed at exit. This is property we need to ensure that the singltons are destructed in the 'reverser order' of their construction.


class FooSingleton
{
public:
~FooSingleton(void);
static FooSingleton& GetSingleton(void);

private :
FooSingleton(void);
static void DestroySingleton(void);

private :
static FooSingleton* m_pFoo;
int m_var;
};

////////////////////////////////
// In the .cpp file
FooSingleton::m_pFoo = NULL;

FooSingleton::FooSingleton(void)
{
//...
// Code construct foo singleton
::atexit(FooSingleton::DestroySingleton); // THIS IS KEY }
}

FooSingleton&
FooSingleton::GetSingleton(void)
{
if( m_pFoo == NULL)
{
m_pFoo = new FooSingleton;
}
return(*m_pFoo);
}

void
FooSingleton::DestroySingleton(void)
{
delete m_pFoo;
}

Now if FooSingleton and BarSingleton objects are coded in similar way and FooSingleton depends on BarSingleton then 'FooSingleton::GetSingleton' will call BarSingleton::GetSingleton(). Hence the BarSingleton will get constructed first and in the process the BarSingleton::DestroySingleton function will get registered with 'atexit. Next FooSingleton::DestroySingleton will get registered with 'atexit'.

When program exits, the functions registered with 'atexit' will be called in the reverse order of registration. Hence the FooSingleton::DestroySingleton will get called first and then BarSingleton::DestroySingleton.

Now we have ensured that the singleton objects are destructed in the reverse order of their construction sequence. This technique is completely 'standards compliant' and hence should work on all C++ compilers.

Many developers have misconception that Singleton is the easiest design pattern to implement. As you can see implementing a 'well behaved' singleton is NOT such an easy problem to solve.

Saturday, August 01, 2009

Visualizing Code Duplication in a project

Treemap visualization is an excellent way to visualize the information/various metrics about the directory tree (e.g. source files directory tree). I have used treemaps for visualizing SourceMonitor metrics about entire project with excellent results. Unfortunately there are very few simple and opensource treemap visualization softwares available. There is JTreemap applet which can be use to view csv files as treemaps. Sometime back an Excel plugin was available from Microsoft Research site. However, there is no trace of it now on the Microsoft research site.

As part of Thinking Craftsman Toolkit, I wrote a simple Tkinter/Python treemap viewer to view the SourceMonitor metrics as treemaps. After writing initial version of Code Duplication Detector, I realized there is no good visualization tool to visually check the 'proliferation' of duplication across various files/directories. The tools like Simian or CPD just give a list of duplications. I thought 'Treemaps' can be excellent tool to visualize the duplication. Hence I added '-t' flag to CDD. This flag displays the treemap view of the Code Duplication. You can see the screen snapshot of the treemap view here.(See the thumbnail below)


Tuesday, July 21, 2009

Thinking Craftsman Toolkit on Google code

I have created a project named 'Thinking Craftsman Toolkit (TC Toolkit)' on Google code. Currently it includes three small tools
  1. Code Duplication Detector (CDD)
    Code duplication detector is similar to Copy Paste Detector (CPD) or Simian. It uses Pygments Lexer to parse the source files and uses Rabin Karp algorithm to detect the duplicates. Hence it supports all languages supported by Pygments.

  2. Token Tag Cloud (TTC)
    Sometime back I read a blog article 'See How Noisy Your Code Is'. TTC is tool for creating various tag clouds based on token types (e.g. keywords, names, classnames etc).

  3. Sourcemonitor Treemap Viewer (SMTreemap)
    Source Monitor is an excellent tool to generate various metrics from the source code (e.g. maxium complexity, averge compelxity, line count, block depth etc). However, it is difficult to quickly analyse this data for large code bases. Treemaps are excellent to visualize the hierarchical data on two dimensions (as size and color). This tool uses Tkinter to display the SourceMonitor data as treemap. You have to export the source monitor data as CSV or XML. smtreemap.py can then use this CSV or XML file as input to display the treemap
There is no installer or setup file yet. You can get the tools by checking out the source from the SVN repository.

As I promised in the last blog post on 'Writing Code Duplication Detector', source for Code Duplication Detector is now released as part of TC Toolkit project.

Wednesday, June 10, 2009

Writing a Code Duplication Detector

Now that I have started consulting on software development, I am developing a different way of analyzing code for quickly detecting code hotspots which need to addressed first. The techniques I am using are different than traditional 'static code analysis' (e.g. using tools like lint, PMD, FindBugs etc). I am using a mix of various code metrics and visualizations to detect 'anomalies'. In this process, I found a need for a good 'code duplication detector'.

There are two good code duplication detector already available.
  1. CPD (Copy Paste Detector) from PMD project.
  2. Simian from RedHill Consulting.
I am big fan of reuse and try to avoid rewrites as much as possible. Still in this case both tools were not appropriate for my need.

Out of box CPD supports only Java, JSP, C, C++, Fortran and PHP code. I wanted C# and other languates also. It means I have to write a lexers for any new language that I need.

Simian supports almost all common languages. but it is 'closed' source. Hence i cannot customize or adopt it for my needs.

So the option was to write my own. I decided to write it in Python. Thanks to Python and tools available with Python, it was really quick to write. In 3 days, I wrote a nice Code Duplication Detector which supports lots of languages and optimized it also.

The key parts for writing a duplication detector are
  1. Lexers (to generate tokens by parsing the source code)
  2. good string matching algorithm to detect the duplicates.
I wanted to avoid writing my own lexers since it meant writing a new lexer every time I want to support a new language. I decided to piggy back on excellent lexers from Pygments project. Pygments is a 'syntax highligher' written in Python. It already supports large number of programing languages and markups and it is actively developed.

For algorithms, I started by studying CPD. CPD pages just mentions that it uses Rabin Karp string matching algorithm however I could not find any design document. However there are many good refrences and code examples of Rabin Karp algorithm on internet. (e.g. Rabin Karp Wikipedia entry). After some experimentation I was able to implement it in Python.

After combining the two parts, the Code Duplication Detector (CDD) was ready. I tested it on Apache httpd server source. After two iterations of bug fixes and optimizations, it can process 865 files of Apache httpd source code in about 3min 30 seconds. Not bad for a 3 days work.

Check the duplicates found in source code ofApache httpd server and Apache Tomcat server
  • Duplicates found in Apache httpd server source using CDD
    Number of files analyzed : 865
    Number of duplicate found : 161
    Time required for detection : 3 min 30 seconds

  • Duplicate found in Apache Tomcat source code using CDD
    Number of files analyzed : 1774
    Number of duplicate found : 436
    Time required for detection : 19 min 35 seconds
Jai Ho Python :-)

Now I have started working on (a) visualization of amount of duplication in various modules (b) generating 'tag clouds' for source keywords, class names.

PS> I intend to make the code CDD code available under BSD license. It will take some more time. Mean while if you are interested in trying it out, send me a mail and I will email the current source.

Update (21 July 2009) - CDD Code is now available at part of Thinking Craftsman Toolkit project on google code.

Saturday, May 23, 2009

Treemap visualization of Results of 2009 General Elections of India

Recently I am interested in various visualization techniques. One of the technique that I find very interesting is 'Treemaps'. While studying treemaps, I found excellent Javascript visualization library "The Javascript Information Visualization Toolkit" .

Using the treemap component from TheJIT and data from Election Commision's site, I made a treemap visualization of results of 2009 general elections of India. This representation gives an overview of elections results across states, coalitions and parties compared to results of 2004 elections.

The visualization is published on my website.

Monday, May 04, 2009

Thinking Craftsman website

As you know, I am working as Consultant and teacher/mentor in the Craft of Software Development. Now I have my website Thinking Craftsman.

Why the name Thinking Craftsman ???


First Software Development is still a 'Craft'. We may call it 'Software Engineer' but it is really a Craft. If you compare other engineering disciplines, software is still a much more person dependant. Quality of Code depends a lot on the developer. Thus Software Development/Coding is a 'Craft' albeit a Modern Craft. Hence us, the Software Developers, are really 'Software Craftsman'. I first came across concept of Software as Craft in Pragmatic Programmer book and it struck a cord.
  1. The is suggested by the sub-title of "Pragmatic Programmer" book which is "From Journeyman to Master"
  2. Software Craftsmanship Wikipedia article
  3. Craftsmanship : article on Joel on Software
As in any Craft, software craftsman also progresses along the ladder from Novice -> Apprentice -> Craftsman -> Master Craftsman. Unfortunately there is no course or book which teaches you how to progress from Craftsman to Master Craftsman level. Obviously Master Craftsmen of any craft are extremely rare and same is true for craft of software development. If you got a chance to work with a Master Craftsman you are very fortunate because you will learn tremendously within a short period.

Personally I think there IS a level between Craftsman and Master Craftsman. I call this level 'Thinking Craftsman'. A Thinking Craftsman is someone who is always thinking about what he is doing while he is doing it and thinking about ways to improve it. Thus EVERY DAY he/she is taking a small step towards the ultimate goal of becoming a 'Master Craftsman'. A Master Craftsman may directly give solution to a problem. The Thinking Craftsman may try multiple options and finally reach the same solution but he/she will not give up till he/she reaches the solution.

All these year, I have consitently tried to be a 'Thinking Craftsman'. Now I am looking forward to guide other Craftsman in their Journey to becoming a Thinking Craftsman and beyond through my consultancy work and learning programs.

Monday, April 13, 2009

Book recommendations for Software Developers

As I mentioned in my previous post on 'C++ Book Recommendations', I have now published a list of books for software developers. These are the books helped me in developing my ideas about Software Development (irrespective of technlogy or programming languages)

Check the list at "Book Recommendations for Software Developers"

Monday, April 06, 2009

Unusual way of Performance Optimization

I have done performance optimizations in all my projects. Usually it involves selecting the appropriate data structure or algorithm, caching the calculated data to avoid re-computations etc. However, sometimes performance optimizations pose unique challenges.

In this particular case, the problem involved fitting a spline surface to cloud of points. To fit the final surface, a reference spline surface was created and point cloud was projected on to the reference spline surface. The (u,v) coordinates of projected point were then assigned as reference (u,v) point to original point. We were using Parasolid geometry kernel. Parasolid has a point projection function. There is no analytical solution for projecting the point on spline surface. Hence the algorithms for projecting point to spline surface are iterative algorithms which converge to the solution within given tolerance. However, in our case, we were projecting anywhere from 3000 to 10,000 points to a spline surface. In the typical use-case, the projection of point cloud happened many times on different reference surfaces.

Initial naive version took 30 to 35 minutes to complete simplest case. Complex cases required hours. Obvious this was 'deal breaker'. It was clear that this was not going to be acceptable to customer. We did some obvious changes like caching the results wherever possible. This somewhat improved the computation time (e.g. from 30 mins to 10-15 mins).

Obviously that was not enough. We wanted to do the simplest cases in seconds. A different solution was needed and we were stuck. Implementing the projection algorithm was out of question (as it is complex algorithm and Parasolid implementation was probably already highly optimized). Parallelizing with multiple threads was another option but that would give around 3/4 times speedup. We needed much-much larger speedup than that.

I re-looked at the Parasolid projection function and found that there is an additional parameter to the function. It is possible to give a 'hint' about projected (u,v) values. If the hint is good, it can cut down the number of iterations required to converge to the solution. But the way to compute the 'hint (u,v) values' has to be simple and quick. All of a sudden, I realized that the input point cloud is a laser scanned data, so the chances that previous point is near to current point are very high. Now the solution was simple use the calculated (u,v) values of previous point as 'hint' for the calculation of projected (u,v) for the current point. Since most of the cases, the previous point and current point are very near to each other, number of iterations required to compute the projection were drastically reduced. The code change required to implement this solution was 'minimal'. But the speed ups achieved were really dramatic.

We did few other performance improvements. Finally when we delivered the project the same simple testcase which took 30 minutes was finishing in 7-10 seconds. A speed up of 180 times. :-)

This was really most unusual and dramatic performance optimization that I did. Also it was certainly the case that 'complex looking problem usually have very simple solution'

Wednesday, April 01, 2009

C/C++ Book Recommendations

I have prepared a list of C/C++ books which helped me in improving my programming skills. I have put the list on my website.

In few days I am planning to make two more lists. These lists are books and articles which influenced my thinking on software design and programming.
If you find this lists useful or want to just a book or link, please leave a comment.

Tuesday, March 24, 2009

Comparison of VSS, CVS and SVN

I have prepared a comparison among the three commonly used Version Control Systems Visual SourceSafe (VSS), CVS and Subversion.

The weights and scores are based on my judgement. I think this type weight scores based comparison may help you in convincing people (e.g. your project team, colleages, senior mgmt in your company) to use Subversion.

Total Score
Subversion : 251
CVS : 171
Visual SourceSafe(VSS) : 138

Check it out at Comparison of VSS, CVS and SVN

Tuesday, March 10, 2009

Leadership

I found this good list of characteristics of Leader in 'C++ Coding Standard'. Well, c++ coding standard is not a likely place for finding advice on leadership. But sometimes you find get thought gems at unlikely places. So here it is

Leaders:
  1. lead by example
  2. don't ask anything of anyone that they wouldn't do themselves
  3. are called on to make difficult and unpopular decisions
  4. keep the team focused
  5. reward/support their team in whatever they do
  6. keep/clear unnecessary crap out of the way of the team
  7. TAKE the blame when things go wrong and SHARE the credit when things go right.

The advice resonated with how I think.

Monday, March 02, 2009

Portable Precompiled Headers for c/c++

If you are writing cross platform application which is fairly big, then one problem you may encounter is how to reduce compilation time. (Precompiled Headers)

One way to reduce the compilation time is to reduce the number of #include class for the same file. Usually we add #ifndef/#endif pair at start and end of header file to avoid problems of multiple includes. However, compiler still has to parse the entire file to find #ifdef and its matching #endif and then throw away the entire file at the 2nd and subsequent times.

On Windows, Visual Studio compiler has an option of 'using' precompiled headers. Precompiled headers are supposed to solve this problem. However, my experience is sometimes turning ON precompiled headers actually increases the compilation time. This is typically the case when if the files which are regularly modified are added in the precompiled headers. Also it is specific to compiler and not useful on other compilers/platforms.

The technique below gives you reduced compilation time by ensuring that each header file (.h) is included ONLY once. It also has some nice additional side benefits like correctly defining the include sequence, simpler cross platform includes (e.g. using stdlib.h on windows but unistd.h on unix), easier way to change include paths etc.

The technique relies on #ifdef and #endif pair. Since #ifdef/#endif are part of every C++ compiler implementation, it works with all C++ compilers.

Here are steps
  1. In each module in your project, add a module specific include file module_inc.h
  2. Suppose you want to add a new header (a.h) in the module. And a.h depends on b.h and c.h. make following changes in the module_inc.h

    // define dependancies of a.h
    #ifdef A_INC
    #define B_INC // if a.h depends on b.h
    #define C_INC // if a.h depends on c.h
    #endif //A_INC
    // define dependancies of b.h
    #ifdef B_INC
    #define C_INC
    #define D_INC
    #endif //B_INC

    #ifdef C_INC
    #include "c.h"
    #endif

    #ifdef B_INC
    #include "b.h"
    #endif

    #ifdef A_INC
    #include "a.h"
    #endif

  3. Now to add the a.h files dependancy on b.h and c.h, add the following lines in a.h

    #ifndef A_H
    #define A_H

    #define B_INC
    #define C_INC

    #endif
  4. To add a.h in a.cpp, add following lines in a.cpp

    #define A_INC
    #include "module_inc.h"
Thats it.

Advantages in defining include files and dependancies this way

You will find few unique advantages in defining include files and dependancies this way
  • First a.h is directly include only in "module_inc.h". Everywhere else module_inc.h is included and only A_INC is defined.
  • Even though a.h depends on b.h and c.h, both b.h and c.h are NOT directly included in or in a.cpp. Still b.h and c.h are get preprocessed in correct sequence and processed only once.
  • It is very difficult to create circular dependancies in include files in this framework. Compilation will break in case of circular dependacies, which is a good thing.
  • Since a.h is never directly included in a.cpp or any other .cpp file in the module, (a) you can directory or path of a.h, by changing the path at module_inc.h (b) you can split the file in two header files a1.h and a2.h and change the includes in module_inc.h. In both the cases, the changes are 'localized' to module_inc.h. After the modifications are done in module_inc.h, you will just need to recompile your project. Imagine how hard it will be without this kind of structure.
  • It doesnot matter the sequence in which 'A_INC', 'B_INC' or 'C_INC' is defined, the files are always included in correct sequence. So the correcting the sequence of #include statements in different .cpp files because of changes in class dependancies it not required.

How exactly does this work ?

Lets look at what happens when compiler processes a.cpp
  1. First A_INC is defined.
  2. Compiler (i.e. preprocessor) starts processing module_inc.h
  3. module_inc.h has a preprocessor directive "#ifdef A_INC". This directive defines the dependancies of a.h by defining two other symbols B_INC and C_INC
  4. The compiler/preprocessor continues the preprocessing of module_inc.h and encounters #ifdef B_INC. This section defines the dependancies of b.h.
  5. The preprocessing continues and compiler encouters #ifdef C_INC and this section has instruction to actually include c.h (i.e. #include "c.h"). Notice that by this time, C_INC is defined TWICE as a.h depends on c.h and also b.h depends on c.h However actual file is included/processed only ONCE (i.e. #include "c.h" happens only once)
  6. The processign continues and b.h and a.h are now included/preprocessed. Notice that sequence of #include is reverse of sequence of dependancy definition. This guarantees that files are included in correct sequence irrespective of sequence of #define calls in .cpp files.
I have used this technique in two large scale cross platform C++ development projects. (150,000+ lines, 25+ man years of development). This technique sped up the full rebuilds, avoided dependancy mistakes, caught potential circular dependancies early in the life cycle of project.

I always try to use techniques which help in either avoiding mistakes or detecting them early. This is one technique definitely helps

Wednesday, January 28, 2009

Using Social Network Analysis with Version control data

As I mentioned in the last post, am experimenting about using social network analysis (sna) on verision control data. Now with SVNPlot project, I have a way of converting the Subversion logs into sqlite database. It allows me to query the data in many different ways.

I used the Rietveld repository data and did some premilinary analysis. I am not an expert on SNA but Initial results look very interesting and promising. You can see the results on my website



Update : Oscar Castaneda has added SNA data extraction to SVNPlot as part of GSoC 2010 project. He has used these modifications to analyze Apache repositories and reported his findings in ApacheCon. Check the details at
  1. Life After Google Summer of Code by Oscar Castaneda
  2. Oscar's GSoC 2010 proposal 
  3. Details on how to use his contributions in SVNPlot to extract the data.

Sunday, January 18, 2009

Social Network Analysis and Version Control

Recently I came across the concept of Social Network Analysis.

Given below is small introduction of Social Network Analysis is from Orgnet site
Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers, web sites, and other information/knowledge processing entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes. SNA provides both a visual and a mathematical analysis of human relationships.
The concept is originated in 'social sciences (socialogy, anthropology)' to study the relationships on communities. Today it is being used in fraud ring detection, identifying leaders in organizational network,analyzing the relience of computer networks and various other ways. The various casestudies from Orgnet site can give you good idea about the possibilities.

I started thinking about applying SNA for version control history with files and authors as nodes. There is some research going on in this area in universities. References below have few links. Google search with "data mining version control" will give you additional links

With SVNPlot, now I have a way of converting Subversion logs into an SQLite database. Also Python have some excellent libraries for Network analysis. I am using NetworkX for analysis and Matplotlib for visualization. I think such analysis will be useful in
  1. In indentifying the key developers and their specific areas in the project.
  2. Key files (files which are involved in the code changes more frequently than others)
  3. Identify the clusters of related files (across directories and modules)
I think the results will be useful to software development companies as well especially for getting advance warning for problems and especially big projects in indentifying critical developers, planning the technology transfer during movement from people from one project to another etc. I see many exciting possibilities.

The initial results are interesting. I will put up the charts/analysis etc on my site in a few days time.

References and Interesting Articles/Links
  1. Introduction to Social Network Analysis (from orgnet.com)
  2. Casestudies of Social Network Analysis (from Orgnet.com)
  3. Wikipedia page on Social Networks (Check the history of Social Network Analysis)
  4. Social Life of Routers (Computer networks as social networks)
  5. Finding Go-to People and Subject Matter Experts in Organization
  6. Predicting Defects using Network Analysis on Dependency Graphs – ICSE 2008
  7. Mining Software Archives (a special issue of IEEE magazine)

Wednesday, January 14, 2009

SVNPlot - my first opensource project

During the 1 week gap between the two jobs, I finally started an opensource project. The project is called in SVNPlot. It is inspired by the excellent StatSVN Subversion Statistics generation package.

SVNPlot generates graphs similar to StatSVN. The difference is in how the graphs are generated. SVNPlot generates these graphs in two steps. First it converts the Subversion logs into a 'sqlite3' database. Then it uses sql queries to extract the data from the database and then uses excellent Matplotlib plotting library to plot the graphs.

I believe using SQL queries to query the necessary data is resulting great flexibility in data extraction. Also since the sqlite3 is quite fast, it is possible to generate these graphs on demand.

As tribute to python and author of Python-Guido van Rossum, I have generated the graphs for Rietveld project. Check it out here

SVNPlot
is hosted on Google code (http://code.google.com/p/svnplot/) and licensed under New BSD license. For information on installation and usage, check the introduction page here

I am using python to implement SVNPlot. I am a novice to python. Hence any suggestions to improvement are welcome.