Thursday, September 23, 2010

How YouTube detects copies of copyrighted material ?

Coding Horror Blog of Jeff Atwood usually have very useful and well written articles.  In the recent article titled 'YouTube vs. Fair Use' he talks about his experience of uploading a 90 second video from a movie as reference to a blog article. The interesting part of the article is his observations about how You Tube is able to 'detect' that this 90 sec video is from some movie (a copyrighted material).

While reading this article, I discovered a bunch of interesting links and information about the detecting the copies of audio and video files.

A TED Talk by Margaret Gould Stewart on "How YouTube thinks about copyright" 
The interesting parts of this video describe how YouTube detects possible 'copies' of the copyrighted material.
"The scale and speed of this system is truly breathtaking -- we're not just talking about a few videos,we're talking about over 100 years of video every day between new uploads and the legacy scans we regularly do across all of the content on the site. And when we compare those 100 years of video, we're comparing it against millions of reference files in our database. It'd be like 36,000 people staring at 36,000 monitors each and every day without as much as a coffee break. "
While Google tools usual work on massive scale, this one is in a class of its own. As Jeff has observed in his article, the scope and scale is AMAZING.

I also discovered an interesting mobile phone application named "Shazamwhile reading the related linksShazam is an application which you can use to analyse/match music. When you install it on your phone, and hold the microphone to some music for about 20 to 30 seconds, it will tell you which song it is from.
  • This is an article which explores "How Shazam works?"
  • There is another article which describes an experimental implementation of Shazam in Java. "Creating Shazam in Java". The code is not available because of Patent issues.

Duplication detection (in text, Audio and video) is very interesting problems. Implications of automatic duplication detection are useful as well as frightening.

Monday, September 13, 2010

Rereading Mythical Manmonth - Flowcharts and UML

This is a second entry about Rereading Mythical Manmonth. You can read the first one here.

When I started programming, I never used to create 'detailed flow charts' while many people (usually who are not actively coding) urged it. I did create some small flowchart to get overall picture in my mind but never detailed flowcharts. Later same thing happened with UML diagrams, I always felt little guilty about it. Till I read, Fred Brooks observations about Flow charts.

These are the Fred Brooks observations about Flow Charts in section named 'Flow Chart Curse' (in essay 15, The Other Face).  
The flow chart is a most thoroughly oversold piece of program documentation. Many programs don't need flow charts at all; few programs need more than a one-page flow chart. 
The detailed blow-by-blow flow chart, however, is an obsolete nuisance, suitable only for initiating beginners into algorithmic thinking.
In fact, flow charting is more preached than practiced. I have never seen an experienced programmer who routinely made detailed flow charts before beginning to write programs. Where organization standards require flow charts, these are almost invariably done after the fact. Many shops proudly use machine programs to generate this "indispensable design tool" from the completed code. I think this universal experience is not an embarrassing and deplorable departure from good practice, to be acknowledged only with a nervous laugh. Instead it is the application of good judgment, and it teaches us something about the utility of flow charts.
Try replacing the 'Flow chart' in the above discussion with 'UML Diagrams' and the whole discussion now suddenly seems very recent discussion. In my experience 'UML Diagrams' are more preached that practiced and most of the time these diagrams are created 'automatically' from the code at the end of development cycle.

Obviously with UML also, I create a few class diagrams and sequence diagrams to clear top level picture. But I don't create very detailed UML diagrams of every single class and function. In my experience, when a team tries to create very granular UML diagrams :
  • it confuses every one in the team, 
  • it takes too much efforts 
  • and hence not really worth it.
However, many times Customers and internal SEPG groups ask for UML diagrams (sometimes because of mostly misguided notion that a UML documented design means 'good design', sometimes because of everybody is doing it, sometimes because of all the UML hype).

Still I could never clearly explain why UML is not a 'silver bullet' in documenting software design. That is till I reread the 'No Silver Bullet' and the section 'Essential difficulties->Invisibility'.
Software is invisible and unvisualizable. Geometric abstractions are powerful tools. The floor plan of a building helps both architect and client evaluate spaces, traffic flows, views. Contradictions become obvious, omissions can be caught. Scale drawings of mechanical parts and stick-figure models of molecules, although abstractions, serve the same purpose. A geometric reality is captured in a geometric abstraction.

The reality of software is not inherently embedded in space. Hence it has no ready geometric representation in the way that land has maps, silicon chips have diagrams, computers have connectivity schematics. As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs, superimposed one upon another. The several graphs may represent the flow of control, the flow of data, patterns of dependency, time sequence, name-space relationships. These are usually not even planar, much less hierarchical. Indeed, one of the ways of establishing conceptual control over such structure is to enforce link cutting until one or more of the graphs becomes hierarchical.

In spite of progress in restricting and simplifying the structures of software, they remain inherently unvisualizable, thus depriving the mind of some of its most powerful conceptual tools. This lack not only impedes the process of design within one mind, it severely hinders communication among minds.
Since software is inherently 'unvisualizable' (as explained above), software developer will always feel that any graphical modeling technique (Flow charts, UML Diagrams, Swimlanes, data flow diagrams, etc etc) is insufficient to imagine and communicate the software design.  

So in conclusion:
  • Software design is hard.
  • Documenting and communicating software design is even harder.
  • Flowcharts, UML diagrams or any other diagrams are not a silver bullet of 'design documentation'.
  • Most likely, there will never be a 'silver bullet' in design documentation.
UPDATE (Oct 30, 2010):
I fully agree that UML diagrams are useful communication tool to document/discuss/understand the overall design. However, it utterly fails as  'documentation' tool for documenting every class and function. Documenting every single class with UML results in information overload. So I agree with observations of 
Stephan.Schmidt and others.

In one of the project that I worked, we had about 10 different modules. Each module had some 2/3 key classes. We created module dependency diagram. We documented the class hierarchy (inheritance and aggregation relationships) of these classes in a single class diagram. We also documented names of various design patterns used in these classes. Then we documented key functions of these modules with 2/3 sequence diagrams. This documentation served us well. New team members can quickly get the hang of the system and start contributing.  
At the end of the project, our customer insisted on UML diagrams for every single class and function. Hence we reverse engineered whole system and created UML documentation of every single class. It was completely incomprehensible and hence completely useless.

    Wednesday, September 08, 2010

    Rereading Mythical Man month

    Recently I started rereading Mythical Man Month by Fred Brooks. The book is a collection of essays about programming and software development in general. First edition was published in 1974. I am reading the 20th Anniversary edition published in 1994. So the one I am reading is also 16 years old.

    First time I read this book sometime in year 2000. At that time, I was a just a somewhat experienced developer. In last 10 years, I have handled teams, worked in project leader roles, taught software development, design, architecture.  Now when I started rereading the book, its a very interesting and educational experience. Some things which I did not completely understand or believe at that time, now those ideas make sense. Though the ideas and advice in the book are still very sound advice even after 35 years. Still this book  is known as a book 'that is quoted often but followed rarely'. I agree to that.

    I am going to review some ideas on the book which stuck a cord.

    I found a following gem in Chapter 13, The Whole and The Part. Here Author is talking about the 'System Debugging'. 
    Build Plenty of Scaffolding
    By scaffolding I mean all programs and data built for debugging purposes but never intended to be in the final product. It is not unreasonable for there to be half as much code in scaffolding as there is in product.

    One form of scaffolding is the dummy component, which consists only of interfaces and perhaps some faked data or some small test cases. For example, a system may include a sort program which isn't finished yet. Its neighbors can be tested by using a dummy program that merely reads and tests the format of input data, and spews out a set of well-formatted meaningless but ordered data.

    Another form is the miniature file. A very common form of system bug is misunderstanding of formats for tape and disk files. So it is worthwhile to build some little files that have only a few typical records, but all the descriptions, pointers, etc.
    I suddenly realized that the description is very similar to description in Agile books/articles about Test Driven Development. (with some minor change in terminology). 'dummy component' is Mock Objects. The estimate of proportion of 'test code' to 'production code' is also very similar('It is not unreasonable for there to be half as much code in scaffolding as there is in product' )

    I found few more such gems. Later I will write about them.