Questions tagged [mapreduce]

MapReduce is an algorithm for processing huge datasets on certain kinds of distributable problems using a large number of nodes

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.

12151 questions

280

votes

10 answers

Java8: HashMap to HashMap using Stream / Map-Reduce / Collector

I know how to "transform" a simple Java List from Y -> Z, i.e.: List x; List y = x.stream() .map(s -> Integer.parseInt(s)) .collect(Collectors.toList()); Now I'd like to do basically the same with a Map,…

asked Sep 18 '14 at 02:14

Benjamin M

23,599
32
121
201

239

votes

3 answers

Map and Reduce in .NET

What scenarios would warrant the use of the "Map and Reduce" algorithm? Is there a .NET implementation of this algorithm?

c# mapreduce

asked Jan 09 '09 at 16:40

Developer

17,809
26
66
92

215

votes

4 answers

Good MapReduce examples

I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be. I'm not looking for code-snippets,…

mapreduce

asked Sep 11 '12 at 18:31

pagid

13,559
11
78
104

180

votes

8 answers

Simple explanation of MapReduce?

Related to my CouchDB question. Can anyone explain MapReduce in terms a numbnuts could understand?

frameworks mapreduce glossary

asked Aug 26 '08 at 19:58

reefnet_alex

9,703
5
33
32

142

votes

8 answers

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair. What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

sorting hadoop mapreduce hdfs shuffle

asked Mar 03 '14 at 08:10

Nithin

9,661
14
44
67

127

votes

14 answers

Chaining multiple MapReduce jobs in Hadoop

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. i.e. Map1 , Reduce1 , Map2 , Reduce2 , and so on. So you have the output from the last reduce that is needed as the input for the next…

hadoop mapreduce

asked Mar 23 '10 at 11:55

Niels Basjes

10,424
9
50
66

125

votes

6 answers

How does Hadoop process records split across block boundaries?

According to the Hadoop - The Definitive Guide The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than…

hadoop split mapreduce hdfs

asked Jan 12 '13 at 07:10

Praveen Sripati

32,799
16
80
117

123

votes

11 answers

Can Apache Spark run without Hadoop?

Are there any dependencies between Spark and Hadoop? If not, are there any features I'll miss when I run Spark without Hadoop?

hadoop amazon-s3 apache-spark mapreduce mesos

asked Aug 15 '15 at 06:51

tourist

4,165
6
25
47

121

votes

4 answers

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves…

algorithm sorting parallel-processing hadoop mapreduce

asked Jul 20 '09 at 10:07

Niels Basjes

10,424
9
50
66

116

votes

11 answers

Does MongoDB's $in clause guarantee order

When using MongoDB's $in clause, does the order of the returned documents always correspond to the order of the array argument?

mongodb mongoose mapreduce mongodb-query aggregation-framework

asked Apr 01 '14 at 22:01

user2066880

4,825
9
38
64

100

votes

15 answers

Is there a .NET equivalent to Apache Hadoop?

So, I've been looking at Hadoop with keen interest, and to be honest I'm fascinated, things don't get much cooler. My only minor issue is I'm a C# developer and it's in Java. It's not that I don't understand the Java as much as I'm looking for the…

c# .net hadoop mapreduce

asked Dec 04 '08 at 01:18

danswain

4,171
5
37
43

votes

9 answers

Container is running beyond memory limits

In Hadoop v1, I have assigned each 7 mapper and reducer slot with size of 1GB, my mappers & reducers runs fine. My machine has 8G memory, 8 processor. Now with YARN, when run the same application on the same machine, I got container error. By…

hadoop mapreduce hadoop-yarn mrv2

asked Jan 08 '14 at 20:18

Lishu

1,438
1
13
14

votes

4 answers

What is Map/Reduce?

I hear a lot about map/reduce, especially in the context of Google's massively parallel compute system. What exactly is it?

language-agnostic mapreduce

asked Dec 23 '08 at 06:48

Lawrence Dol

63,018
25
139
189

votes

8 answers

When do reduce tasks start in Hadoop?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

hadoop mapreduce reduce

asked Jul 26 '12 at 15:25

Slayer

2,391
4
21
18

votes

3 answers

MongoDB Stored Procedure Equivalent

I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given zip code. In SQL Server, I would execute a stored…

stored-procedures mongodb geolocation mapreduce

asked Oct 06 '10 at 19:09

Abe

6,386
12
46
75

2 3

…

99 100 Next