Monday, April 12, 2010

Google not identifying pages with the same content correctly?

Although identification of web pages with the same content has been the target of active research for many years now it seems that even Google is having issues dealing with this problem correctly.
I am trying to build GWT from the source now and for some reason the simple procedure described here: http://code.google.com/webtoolkit/makinggwtbetter.html#compiling does not work for me. Thus I set on googling for building gwt from source but it turns out that the first page of results contains more then four hits to the same mail thread (or even to the same individual emails in this thread). In other words half of the top 10 results points to the same content.

The following are the hits from top 10 results:

Not sure this is what I would expect!

Yes, extracting content from public email lists can be very hard especially if the indexer does not have access to the source of the text but only to the various HTML representations of the same message but I thought that there are techniques how to deal with document similarity: e.g. MinHash (see Duplicate Detection )

No comments: