Tuesday, February 19, 2008

Hadoop - Large Yahoo! cluster

Yahoo!'s bet on Hadoop framework proves to be a good choice. They started using Hadoop in their production environment and every web search query is now using data which is processed on very large Hadoop cluster. You can find more details on Yahoo! Developer Network: Yahoo! Launches World's Largest Hadoop Production Application.

Some highlights:
  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes
  • Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration.
I think this is quite impressive. Considering that Hadoop is open source software in early stage of development written in Java could this be the real reason why Microsoft want to buy Yahoo!? :-)

Update - added few more links:
Jeremy Zawodny blog (Yahoo!) http://jeremy.zawodny.com/blog/archives/009992.html
Interview with Doug Cutting (InfoQ) http://www.infoq.com/articles/hadoop-interview

No comments: