Hadoop map only job

Question

My situation is like the following:

I have two MapReduce jobs. First one is MapReduce job which produces output sorted by key.

Then second Map only job will extract some part of the data and just collect it.

I have no reducer in second job.

Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.

score 1 · Answer 1 · edited May 23 '17 at 12:11

1

First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.

A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.

If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well. The "hard" solution is what the "Tera sort" benchmark uses. Read this SO question for more insight into how that works: How does the MapReduce sort algorithm work?

edited May 23 '17 at 12:11

Community

1
1

answered Sep 02 '13 at 18:21

Niels Basjes

10,424
9
50
66

It is a bit more complicated than this.The first job is reading Avro files and filtering some records. Then it produces Avro file again. The second job is just reading the output from the first job and converts the output to text format. And so it seems that the output from the second job will be sorted , because the input is sorted ? – Georgi Sep 03 '13 at 07:47
1

Why can't the reducer of the first job output the info in text format immediately. You can even output the data filtered/nonfiltered in several formats from the reducer directly using the multipleoutputformats. – Niels Basjes Sep 03 '13 at 08:01
Well that is exactly what I want to do. Unfortunately i have some trouble with AVro . May be i don't know how to use it properly. What i tried is : In reducer `collector.collect("Some text")` In job conf `AvroJob.setOutputSchema(conf,Schema.create(Type.STRING));` and it is complaining that is not a Piar schema – Georgi Sep 03 '13 at 08:18
1

I suggest you create a new question about this problem. – Niels Basjes Sep 03 '13 at 19:57

score 0 · Answer 2 · answered Sep 02 '13 at 14:53

0

No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

answered Sep 02 '13 at 14:53

twid

6,368
4
32
50

Hadoop map only job

2 Answers2