I want to split output of the job by date (key=date
, value=big_json
).
In hadoop1 I had a special Java class, inherited from MultipleTextOutputFormat
.
As far as I know, this is deprecated in hadoop2.
The documentation points out, that
Use in conjuction with org.apache.hadoop.mapreduce.lib.output.MultipleOutputs to recreate the behaviour of org.apache.hadoop.mapred.lib.MultipleTextOutputFormat (etc) of the old Hadoop API.
But I don't really understand how to use it in my script. What params should I use?
hadoop jar /usr/local/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.4.4.jar -D mapred.job.name=split-parsed-logs -D mapred.reduce.tasks=140 -D mapred.task.timeout=10000000 -mapper python -m timestamp-and-json -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input /tmp/parsed_logs -output /tmp/splitted_logs -file /home/user/app.mod -cmdenv PYTHONPATH=app.mod -outputformat org.apache.hadoop.mapreduce.lib.output.MultipleOutputs