Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions
280
votes
10 answers

Java8: HashMap to HashMap using Stream / Map-Reduce / Collector

I know how to "transform" a simple Java List from Y -> Z, i.e.: List x; List y = x.stream() .map(s -> Integer.parseInt(s)) .collect(Collectors.toList()); Now I'd like to do basically the same with a Map,…
Benjamin M
  • 23,599
  • 32
  • 121
  • 201
239
votes
3 answers

Map and Reduce in .NET

What scenarios would warrant the use of the "Map and Reduce" algorithm? Is there a .NET implementation of this algorithm?
Developer
  • 17,809
  • 26
  • 66
  • 92
215
votes
4 answers

Good MapReduce examples

I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be. I'm not looking for code-snippets,…
pagid
  • 13,559
  • 11
  • 78
  • 104
180
votes
8 answers

Simple explanation of MapReduce?

Related to my CouchDB question. Can anyone explain MapReduce in terms a numbnuts could understand?
reefnet_alex
  • 9,703
  • 5
  • 33
  • 32
142
votes
8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Nithin
  • 9,661
  • 14
  • 44
  • 67
127
votes
14 answers

Chaining multiple MapReduce jobs in Hadoop

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on. So you have the output from the last reduce that is needed as the input for the next…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
125
votes
6 answers

How does Hadoop process records split across block boundaries?

According to the Hadoop - The Definitive Guide The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than…
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
123
votes
11 answers

Can Apache Spark run without Hadoop?

Are there any dependencies between Spark and Hadoop? If not, are there any features I'll miss when I run Spark without Hadoop?
tourist
  • 4,165
  • 6
  • 25
  • 47
121
votes
4 answers

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
116
votes
11 answers

Does MongoDB's $in clause guarantee order

When using MongoDB's $in clause, does the order of the returned documents always correspond to the order of the array argument?
100
votes
15 answers

Is there a .NET equivalent to Apache Hadoop?

So, I've been looking at Hadoop with keen interest, and to be honest I'm fascinated, things don't get much cooler. My only minor issue is I'm a C# developer and it's in Java. It's not that I don't understand the Java as much as I'm looking for the…
danswain
  • 4,171
  • 5
  • 37
  • 43
95
votes
9 answers

Container is running beyond memory limits

In Hadoop v1, I have assigned each 7 mapper and reducer slot with size of 1GB, my mappers & reducers runs fine. My machine has 8G memory, 8 processor. Now with YARN, when run the same application on the same machine, I got container error. By…
Lishu
  • 1,438
  • 1
  • 13
  • 14
85
votes
4 answers

What is Map/Reduce?

I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?
Lawrence Dol
  • 63,018
  • 25
  • 139
  • 189
85
votes
8 answers

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
Slayer
  • 2,391
  • 4
  • 21
  • 18
81
votes
3 answers

MongoDB Stored Procedure Equivalent

I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given zip code. In SQL Server, I would execute a stored…
Abe
  • 6,386
  • 12
  • 46
  • 75
1
2 3
99 100