0

I want to split output of the job by date (key=date, value=big_json).

In hadoop1 I had a special Java class, inherited from MultipleTextOutputFormat. As far as I know, this is deprecated in hadoop2.

The documentation points out, that

Use in conjuction with org.apache.hadoop.mapreduce.lib.output.MultipleOutputs to recreate the behaviour of org.apache.hadoop.mapred.lib.MultipleTextOutputFormat (etc) of the old Hadoop API.

But I don't really understand how to use it in my script. What params should I use?

hadoop jar /usr/local/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.4.4.jar -D mapred.job.name=split-parsed-logs -D mapred.reduce.tasks=140 -D mapred.task.timeout=10000000 -mapper python -m timestamp-and-json -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input /tmp/parsed_logs -output /tmp/splitted_logs -file /home/user/app.mod -cmdenv PYTHONPATH=app.mod -outputformat org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

IKavanagh
  • 6,089
  • 11
  • 42
  • 47
Alex
  • 467
  • 3
  • 8
  • 1
    Try look at this: http://stackoverflow.com/questions/15100621/multipletextoutputformat-alternative-in-new-api – Avihoo Mamka Sep 27 '15 at 19:17
  • Didn't found how to use it in streaming though, I guess i'll have to write java class for that job. – Alex Sep 29 '15 at 09:00

0 Answers0