The core idea about the MapReduce programming paradigm is that data are portrayed as a number of key-value
pairs, where the key
indicates a specific type for indentification (for example a user ID in case we want to make a program about a calculating something for each user, or a word in case we want to count every occurence of every word as shown here from the Hadoop project itself).
In order to get from the form of the input data to our desired output results, there must be some type of manipulation in order to:
- Convert the input data into one or more pairs of
key-value
sets (so we Map the data) and
- Actually process these sets of data in the way we want to, to get our output (so we Reduce the
key-value
pairs based on the purpose of our program)
Based on the Udacity example you put on your question, we know the form of the input data:
(date, time, store_name, cost)
And we need to find the sales per store. So at first we map the input data into a key-value
pair form in the way where the store_name
of each input record is going to be put in the key
of the key-value
pairs. The output of that function would be to actually read each line and construct a key-value
pair where the information needed are stored in it like this (where store_name
is the key
, and cost
is the value
):
(store_name, cost)
Doing that for every line of the input is likely gonna result into key-value
pairs that have the same store_name
as key
, and based on that we can reduce those pairs by grouping them by key and collecting them in order to add them all up and get the total amount of sales per store. The output of that function would be like this (where store_name
is the key
, and total_sales
is the value
calculated as programmed inside the function by the user):
(store_name, total_sales)
As for the task splitting between mappers and reducers, this is almost entirely up to you, because you have to remember that mappers are executed for all of the input data (e.g. reading all records line-by-line in your example) and reducers are being executed for each key
of the key-value
pairs created from the mappers. The good news here are that projects like Hadoop really semi-do that for you, so all you need to configure is how to map the data in the Map
function and how to process the grouped-by-key data in the Reduce
function.
As for the default single reducer, you may know that you can indeed set the number of reducers to how many you want, but they are just going to be instances of the Reduce
functions being executed on different threads/CPUs/Machines for different data (and that's why Hadoop is being used in Big Data applications, because these enormous amounts of input data are better processed by parallel processing through a cluster of computers).
Your confusion about the mechanics of MapReduce is understandable so far, because in reality the Map
and Reduce
functions used here are derived from the functional programming branch of computing and many people are not that used to it. Consider studying more examples of simple MapReduce applications to get a better idea of their mechanics, like the WordCount example or this simple program of a MapReduce job I wrote to answer another question where we count each athlete's total Olympic Gold wins).
As for the of the possibility a bigger number of tasks, it's really up to you to decide where to cut and share pieces of input data to the mappers (you can find out more from the answers in this question) or how many reducers to initialize to process you data in parallel (for more than the default 1 reducer, of course).