-1

I have configured a Hadoop setup in pseudo-distributed mode. The question in short is: How to decide which subtasks to be assigned to the mappers and which to the reducers to?

Details: Regarding this Udacity Course:Intro to Hadoop and MapReduce the problem was:

Data comes from several branches around the world the belong to the same store company. Each data record stores a sale (receipt) in any store. The data is in form of: (date, time, store_name, cost). E.g. (2012-01-01, 12:01, New York Store, 12.99$). The task was to get sales per store.

Udacity's solution was:

  1. Mappers (Only) read the file line-by-line and pass it to the reducers (Pretty much Reading & passing the file lines)
  2. Reducers collect the sorted keys (which is the store name) and add them!

This choice of splitting tasks between mappers and reducers confuses me. It seems like the reducer is still doing the entire job of reading and adding while in fact, it's by default only 1 reducer, this solution sounds to create a bottleneck in the reducer.

My expected Solution was:

  1. Mappers read the file, each mapper reads a set of sales, adds up the sales per the same store, and passes a list of hashes (keys: stores, values: sum of sales) to reducers.
  2. Reducers (By default it is 1 reducer) got a simplified version and their task is now simpler.

My questions are:

  1. Why they implemented it that way? was it correct? was my understanding of MapReduce was wrong? if yes may you give me links to books, videos, or tutorials that can solve this conflict?
  2. In a project that has a bigger number of tasks, how would I be able to decide which goes to mappers and which to reducers? Is there any reference or metric?
user4157124
  • 2,809
  • 13
  • 27
  • 42

1 Answers1

0

The core idea about the MapReduce programming paradigm is that data are portrayed as a number of key-value pairs, where the key indicates a specific type for indentification (for example a user ID in case we want to make a program about a calculating something for each user, or a word in case we want to count every occurence of every word as shown here from the Hadoop project itself).

In order to get from the form of the input data to our desired output results, there must be some type of manipulation in order to:

  1. Convert the input data into one or more pairs of key-value sets (so we Map the data) and
  2. Actually process these sets of data in the way we want to, to get our output (so we Reduce the key-value pairs based on the purpose of our program)

Based on the Udacity example you put on your question, we know the form of the input data:

(date, time, store_name, cost)

And we need to find the sales per store. So at first we map the input data into a key-value pair form in the way where the store_name of each input record is going to be put in the key of the key-value pairs. The output of that function would be to actually read each line and construct a key-value pair where the information needed are stored in it like this (where store_name is the key, and cost is the value):

(store_name, cost)

Doing that for every line of the input is likely gonna result into key-value pairs that have the same store_name as key, and based on that we can reduce those pairs by grouping them by key and collecting them in order to add them all up and get the total amount of sales per store. The output of that function would be like this (where store_name is the key, and total_sales is the value calculated as programmed inside the function by the user):

(store_name, total_sales)

As for the task splitting between mappers and reducers, this is almost entirely up to you, because you have to remember that mappers are executed for all of the input data (e.g. reading all records line-by-line in your example) and reducers are being executed for each key of the key-value pairs created from the mappers. The good news here are that projects like Hadoop really semi-do that for you, so all you need to configure is how to map the data in the Map function and how to process the grouped-by-key data in the Reduce function.

As for the default single reducer, you may know that you can indeed set the number of reducers to how many you want, but they are just going to be instances of the Reduce functions being executed on different threads/CPUs/Machines for different data (and that's why Hadoop is being used in Big Data applications, because these enormous amounts of input data are better processed by parallel processing through a cluster of computers).

Your confusion about the mechanics of MapReduce is understandable so far, because in reality the Map and Reduce functions used here are derived from the functional programming branch of computing and many people are not that used to it. Consider studying more examples of simple MapReduce applications to get a better idea of their mechanics, like the WordCount example or this simple program of a MapReduce job I wrote to answer another question where we count each athlete's total Olympic Gold wins).

As for the of the possibility a bigger number of tasks, it's really up to you to decide where to cut and share pieces of input data to the mappers (you can find out more from the answers in this question) or how many reducers to initialize to process you data in parallel (for more than the default 1 reducer, of course).

Coursal
  • 1,387
  • 4
  • 17
  • 32