Specify Hadoop process split

Question

I want to run Hadoop MapReduce on a small part of my text file.

One of my task is failing. I can read in the log:

Processing split: hdfs://localhost:8020/user/martin/history/history.xml:3556769792+67108864

Can I execute once again MapReduce on this file from offset 3556769792 to 3623878656 (3556769792+67108864) ?

score 2 · Answer 1 · answered Sep 17 '13 at 05:19

A way to do is to copy the file from the offset define and add it back into HDFS. From this point simply run the mapreduce job only on this block.

1) copy file from offset 3556769792 follow by 67108864:

dd if=history.xml bs=1 skip=3556769792 count=67108864 > history_offset.xml

2) import into HDFS

hadoop fs -copyFromLocal history_offset.xml offset/history_offset.xml

3) run again MapReduce

hadoop jar myJar.jar 'offset' 'offset_output'

1 Answers1