Saturday, August 18, 2007

Scalability and Efficiency of Data Mining

There is a nice video presentation about scalability and efficiency aspects of parallel computation. It touches MapReduce paradigm and a wide portion of the presentation is devoted to a classical problem called Frequent Itemset Mining. Experimental results of other classical data mining tasks are presented as well.

Interestingly, Doug Cutting (one of the leading developer of Hadoop) have a post on his blog about how to use MapReduce to implement ten different machine learning algorithms.

If I understood correctly one of the main points of Wagner's presentation is that current MapReduce is great for stateless computations but it can be a problem (less effective) when stateful approach is needed. For their needs they created MapReduce derived implementation where each Reduce phase can store results and other metadata into external repository so that other tasks can learn about it very fast (so that subsequent Map task can start earlier if it has all the information it needs and does not have to wait until the whole Map phase finishes).

No comments: