Multiple output files for hadoop2 streaming jobs

Asked Sep 25 '15 at 14:21

Active Sep 27 '15 at 18:27

Viewed 233 times

I want to split output of the job by date (key=date, value=big_json).

In hadoop1 I had a special Java class, inherited from MultipleTextOutputFormat. As far as I know, this is deprecated in hadoop2.

The documentation points out, that

Use in conjuction with org.apache.hadoop.mapreduce.lib.output.MultipleOutputs to recreate the behaviour of org.apache.hadoop.mapred.lib.MultipleTextOutputFormat (etc) of the old Hadoop API.

But I don't really understand how to use it in my script. What params should I use?

hadoop jar /usr/local/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.4.4.jar -D mapred.job.name=split-parsed-logs -D mapred.reduce.tasks=140 -D mapred.task.timeout=10000000 -mapper python -m timestamp-and-json -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input /tmp/parsed_logs -output /tmp/splitted_logs -file /home/user/app.mod -cmdenv PYTHONPATH=app.mod -outputformat org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

edited Sep 27 '15 at 18:27

IKavanagh

6,089
11
42
47

asked Sep 25 '15 at 14:21

Alex

1

Try look at this: http://stackoverflow.com/questions/15100621/multipletextoutputformat-alternative-in-new-api – Avihoo Mamka Sep 27 '15 at 19:17
Didn't found how to use it in streaming though, I guess i'll have to write java class for that job. – Alex Sep 29 '15 at 09:00

Multiple output files for hadoop2 streaming jobs

0 Answers0