EMR How to join files into one?

Question

I've splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3. Now I want to join it back to one file and process with my custom

I've tried to run

elastic-mapreduce -j $JOBID -ssh \
"hadoop dfs -cat s3n://bucket/dir/in/* > s3n://bucket/dir/outfile"

but it failed due to -cat output data to my local terminal - it does not work remotely...

How I can do this?

P.S. I've tried to run cat as a streaming MR job:

den@aws:~$ elastic-mapreduce --create --stream --input s3n://bucket/dir/in \
--output s3n://bucket/dir/out --mapper /bin/cat --reducer NONE

this job was finished successfully. But. I had 3 file parts in dir/in - now I have 6 parts in /dir/out

part-0000
part-0001
part-0002
part-0003
part-0004
part-0005

And file _SUCCESS ofcource which is not part of my output...

So. How to join splitted before file?

score 1 · Accepted Answer · edited May 23 '17 at 12:06

1

So. I've found a solution. Maybe not better - but it is working.

So. I've created an EMR job flow with bootstrap action

--bootstrap-action joinfiles.sh

in that joinfiles.sh I'm downloading my file pieces from S3 using wget, join them using regular cat a b c > abc.

After that I've added a s3distcp which copied result back to S3. ( sample could be found at: https://stackoverflow.com/a/12302277/658346 ). That is all.

edited May 23 '17 at 12:06

Community

answered Sep 28 '12 at 08:19

denys

can you give more details about this? I want to run iterative jobs, and before each job i would like to cat all outpus, can i do that? – member555 Aug 24 '15 at 12:10

1 Answers1