Good MapReduce examples

Question

I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be.

I'm not looking for code-snippets, really just "textual" examples.

I think a similar but much better example is to count words for all your text files you have on your computer. It's easier to understand and demonstrates the power of MapReduce. — Peter Lee, Dec 08 '13 at 04:14
To the last four questions that I searched for, I've found them closed as non constructive on this site. By fortune they have answers already. To the authors I endorse my gratitude and, as of now, there were more than 80 individuals that do not understand the closing policy. Not that it matters to others but I'm a professional programmer since the start of the 80's and, by now, I found myself asking the wrong questions :) — Helder Velez, Dec 06 '14 at 23:51
It is worth to have a look at MapReduce design patterns: e.g. some covered in [these slides](https://courses.cs.washington.edu/courses/cse490h/08au/lectures/MapReduceDesignPatterns-UW2.pdf) and more can be seen in [this book](http://shop.oreilly.com/product/0636920025122.do) — Denis, Feb 22 '17 at 15:55

score 315 · Accepted Answer · edited Apr 01 '19 at 23:57

Map reduce is a framework that was developed to process massive amounts of data efficiently. For example, if we have 1 million records in a dataset, and it is stored in a relational representation - it is very expensive to derive values and perform any sort of transformations on these.

For Example In SQL, Given the Date of Birth, to find out How many people are of age > 30 for a million records would take a while, and this would only increase in order of magnitute when the complexity of the query increases. Map Reduce provides a cluster based implementation where data is processed in a distributed manner

Here is a wikipedia article explaining what map-reduce is all about

Another good example is Finding Friends via map reduce can be a powerful example to understand the concept, and a well used use-case.

Personally, found this link quite useful to understand the concept

Copying the explanation provided in the blog (In case the link goes stale)

Finding Friends

MapReduce is a framework originally developed at Google that allows for easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation.

I'll gloss over the details, but it comes down to defining two functions: a map function and a reduce function. The map function takes a value and outputs key:value pairs. For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then map(steve) would return 5:steve and map(savannah) would return 8:savannah. You may have noticed that the map function is stateless and only requires the input value to compute it's output value. This allows us to run the map function against values in parallel and provides a huge advantage. Before we get to the reduce function, the mapreduce framework groups all of the values together by key, so if the map functions output the following key:value pairs:
3 : the
3 : and
3 : you
4 : then
4 : what
4 : when
5 : steve
5 : where
8 : savannah
8 : research
They get grouped as:
3 : [the, and, you]
4 : [then, what, when]
5 : [steve, where]
8 : [savannah, research]
Each of these lines would then be passed as an argument to the reduce function, which accepts a key and a list of values. In this instance, we might be trying to figure out how many words of certain lengths exist, so our reduce function will just count the number of items in the list and output the key with the size of the list, like:
3 : 3
4 : 3
5 : 2
8 : 2
The reductions can also be done in parallel, again providing a huge advantage. We can then look at these final results and see that there were only two words of length 5 in our corpus, etc...

The most common example of mapreduce is for counting the number of times words occur in a corpus. Suppose you had a copy of the internet (I've been fortunate enough to have worked in such a situation), and you wanted a list of every word on the internet as well as how many times it occurred.

The way you would approach this would be to tokenize the documents you have (break it into words), and pass each word to a mapper. The mapper would then spit the word back out along with a value of 1. The grouping phase will take all the keys (in this case words), and make a list of 1's. The reduce phase then takes a key (the word) and a list (a list of 1's for every time the key appeared on the internet), and sums the list. The reducer then outputs the word, along with it's count. When all is said and done you'll have a list of every word on the internet, along with how many times it appeared.

Easy, right? If you've ever read about mapreduce, the above scenario isn't anything new... it's the "Hello, World" of mapreduce. So here is a real world use case (Facebook may or may not actually do the following, it's just an example):

Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If I'm your friend, you're mine). They also have lots of disk space and they serve hundreds of millions of requests everyday. They've decided to pre-compute calculations when they can to reduce the processing time of requests. One common processing request is the "You and Joe have 230 friends in common" feature. When you visit someone's profile, you see a list of friends that you have in common. This list doesn't change frequently so it'd be wasteful to recalculate it every time you visited the profile (sure you could use a decent caching strategy, but then I wouldn't be able to continue writing about mapreduce for this problem). We're going to use mapreduce so that we can calculate everyone's common friends once a day and store those results. Later on it's just a quick lookup. We've got lots of disk, it's cheap.

Assume the friends are stored as Person->[List of Friends], our friends list is then:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D
Each line will be an argument to a mapper. For every friend in the list of friends, the mapper will output a key-value pair. The key will be a friend along with the person. The value will be the list of friends. The key will be sorted so that the friends are in order, causing all pairs of friends to go to the same reducer. This is hard to explain with text, so let's just do it and see if you can see the pattern. After all the mappers are done running, you'll have a list like this:
For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E
For map(D -> A B C E) :

(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
And finally for map(E -> B C D):

(B E) -> B C D
(C E) -> B C D
(D E) -> B C D
Before we send these key-value pairs to the reducers, we group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)
Each line will be passed as an argument to a reducer. The reduce function will simply intersect the lists of values and output the same key with the result of the intersection. For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that friends A and B have C and D as common friends.

The result after reduction is:
(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)
Now when D visits B's profile, we can quickly look up (B D) and see that they have three friends in common, (A C E).

Another example would be analyzing the weather data from across the world. Finding the Max and min for any given region. This is a very good example. — rvphx, Feb 28 '13 at 19:55
Generating all of those intermediate tuples and then later checking the intersection for all, isn't that tedious? Wouldn't it be better to just generate all possible friend pairs, like AB AC BC etc and just pass these pairs with the entire friend lists, just of the two friends in the pair, to a particular machine and let it calculate the intersection? What am I missing here? — GrowinMan, Mar 15 '14 at 00:36
@GrowinMan, I think the point is in scalability, parallelism, and size of the whole set of data. For small problems, MapReduce is not adapted, but for huge sets of data (think millions of people with hundreds of friends each). — Thomas BDX, Apr 15 '14 at 19:13
What if A visit E's profile? There's no (A, E) in the final result though they have friends in common. — Pinch, Jun 28 '14 at 22:01
@Pinch that's because A and E are not friends themselves. In that case this approach seems indeed insufficient (unless you take into account that A or E could hide their friendlist for non-friends:) ) — Pega88, Nov 18 '14 at 16:21
hello karthikr... can you tell me where does the grouping happen as you have specified in your answer..3 : [the, and, you]? — Aviral Kumar, Feb 05 '15 at 11:10
@karthikr : I'm confused about the grouping phase. Map and Reduce can obviously be ran in parallel but what about the grouping phase ? It must be done in a single thread or am I missing something ? — Dinaiz, Aug 21 '16 at 09:05
I don't know why people keep giving extremely simple examples to explain such a powerful tool ? — Pratik Singhal, Dec 27 '18 at 07:45

score 27 · Answer 2 · edited Feb 14 '17 at 18:58

27

One of the best examples of Hadoop-like MapReduce implementation.

Keep in mind though that they are limited to key-value based implementations of the MapReduce idea (so they are limiting in applicability).

edited Feb 14 '17 at 18:58

approxiblue

6,982
16
51
59

answered Sep 11 '12 at 20:57

Nikita Ivanov

406
3
5

You are right. But most of the real-world problems are key-value based or can be/should be translated to Key-value paradigms. – Saisumanth Gopisetty Oct 31 '19 at 04:40

score 5 · Answer 3 · answered Sep 11 '12 at 18:36

One set of familiar operations that you can do in MapReduce is the set of normal SQL operations: SELECT, SELECT WHERE, GROUP BY, ect.

Another good example is matrix multiply, where you pass one row of M and the entire vector x and compute one element of M * x.

score 3 · Answer 4 · edited Feb 14 '17 at 18:50

3

From time to time I present MR concepts to people. I find processing tasks familiar to people and then map them to the MR paradigm.

Usually I take two things:

Group By / Aggregations. Here the advantage of the shuffling stage is clear. An explanation that shuffling is also distributed sort + an explanation of distributed sort algorithm also helps.
Join of two tables. People working with DB are familiar with the concept and its scalability problem. Show how it can be done in MR.

edited Feb 14 '17 at 18:50

approxiblue

6,982
16
51
59

answered Sep 12 '12 at 09:49

David Gruzman

7,900
1
28
30

to explian to non nerds i use the children method: you have a bunch of eager kids, and many many cards. you give each kid an amount of cards telling them to sort them by back of card *deck, then by number/picture then by suit- i.e. the map function each kid finishes and brings to an allotted set of adults, two at a time. each adult "reduces" the pile into one pile, and then each two adults give to a free adult there card stacks. that is by definition the reduce function that can be run more than one time according to the number of kids/stacks. most people get it on the first try – Mickey Perlstein Jul 27 '16 at 13:56

Good MapReduce examples

4 Answers4

Finding Friends

Linked