MapReduce - A way to process BigData

Web and the Enterprise related data is growing at a speed faster than the speed a second ago, an explosion of data. Most of the data is unstructured and we need a way to manage this data or rather generate important information. So how much is BigData really like ? 1024 GB = 1 Terabyte , 1024 Terabytes = 1 Petabyte ... thats massive data and Google processes almost 20 Petabytes of data everyday. So obviously traditional data processing techniques are a waste of time. We need highly optimized data processing technique and yes thousands of machines to get this work done. MapReduce is the way to go for such processing needs.

Data is everywhere :
- Flickr (3 billion photos)
- YouTube (83M videos, 15 hrs/min)
- Web (10B videos watched / mo.)
- Digital photos (500 billion / year)
- All broadcast (70,000TB / year)
- Yahoo! Webmap (3 trillion links,300TB compressed, 5PB disk)
- Human genome (2-30TB uncomp.)


MapReduce is a Programming model ?  OR  Execution environment ?  OR Software package ?  Its all , depending on whom you ask the question.

MapReduce model derives from the map and reduce functions from a functional programming language like Lisp. In Lisp, a map takes as input a function and a sequence of values/lists. It then applies the function to each value/list in the sequence. A reduce combines all the elements of a sequence using a binary operation.For example, it can use "+" to add up all the elements in the sequence and then "-" to reduce it.In Lisp , mapcar applies the function successively to the lists elements in order , producing a new list of values.

(mapcar #'+ '(1 2 3 4) '(10 20 30 40)) => (11 22 33 44)

(reduce '- '(11 22 33 44)) => (- (- (- 11 22) 33) 44) => -88

Here we perform minus operation successively,  11 - 22 = -11 , then -11 - 33 = -44 & then -44 - 44 = -88

MapReduce is inspired by Lisp concepts. It was developed for processing large amounts of raw data like crawled documents or web logs. This being massive amounts of data to be processed (BigData), it must be distributed across thousands of machines to get results in reasonable time. This is similar to parallel computing since the same computations are performed on each CPU, but with a different dataset.

MapReduce is a framework invented by Google in 2004 for running applications (aka jobs) across massive datasets, on huge clusters of machines comprising of commodity hardware capable of processing petabytes of data. It implements a computational paradigm Map / Reduce used in functional programming. In simple terms its a divide and conquer technique where the application data is divided into self contained units of work, each of which may be executed independently on any node in the cluster, a key to Map/Reduce programming model.

MapReduce = Distributed Computation ( on Distributed storage with  Scheduling & Fault Tolrence ) . In general Map/Reduce has two basic steps :

"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes (computer) in the cluster. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Why Map/Reduce ? Performance implications of processing 100 TB of data :
On 1 Node : scanning @50 MB/second ~ 3 years (with uptime , downtime and failures)
On 1000 Node Cluster : scanning @50 MB/second ~ 1 day (with uptime , downtime and failures)

If you are still confused to get a grip about Map Reduce , here's and excellent innovative presentation :



Logically ,
Map function accepts a pair of data with in one data domain, and returns a list of pairs in a different domain:    Map(k1,v1) → list(k2,v2)

This is similar to what Sam did ,  Used a knife (K1) on some fruits ( V1) to produce pieces of each fruit which is similar to list(K2,V2) .

Reduce functions is then applied to V2 to produce a collection of values in same domain :    Reduce(k2, list (v2)) → list(v3)

This is similar to what sam did , used a mixer (K2) and applied it to list-V2 (cut fruits) to create a mixed fruit juice (V3). Each Reduce call will typically produce either one value v3 or an empty return value..(either one glass of mixed fruit juice or multiple glasses of juice)

Lets consider another real example for a smaller set of unstructured data :
We have a some comments from customers of Hotel A, B and C. Lets try to find out the most "Awesome" hotel according to the customers.

Hotel Name, Review
“Hotel C”,”Liked it”
“Hotel B”,”Awesome pool!”
“Hotel C”,”Awesome experience”
“Hotel B”,”Awesome restaurants”
“Hotel A”,”Loved it”
“Hotel A”,”Miserable experience”
“Hotel B”,”Boring”

Map function will create a map with reviews for each hotel. So for Hotel B we would have some thing like :
"Hotel B": "Awesome pool!", "Awesome restaurants", "Boring"

Already map function has helped us gather all reviews for Hotel B. Let the Reduce function do its final job. Reduce function would have an implementation some thing like : "if the review contains 'awesome' word , increment counter for that hotel by 1"

Here's the final output , "Hotel B" would top the list with 2 'awesome' reviews, "Hotel C" with 1, and "Hotel A" with 0 'awesome' reviews. Above example uses only a small set of data , but in real we could have thousands of such review comments for hundreds of hotels.

Below is the view of a Map/Reduce technique on a piece of unstructured data , where we try to find out the occurrences/count of the words cow and dog :


Some more realistic examples that can be directly implemented using MapReduce:

Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1> indicating one occurrence of the URL. The reduce function then adds together all values for the same URL and emits a <URL, total count> pair.
+