1

I am using Amazon EMR and because of the way it works (parallel) my output gets split in multiple files.

But i would like to have one file instead with the right sequence, is it possible to do just that?

my last lines in reducer are like this

for key, value in doc_dict.iteritems():
    print key
    for k, v in value.iteritems():
        print k,v

this is driving me crazy, i cant present results as they are mixed up.

Petros Kyriakou
  • 5,214
  • 4
  • 43
  • 82
  • what is the shell command you're running to submit the job. are you using `hadoop-streaming`? – maxymoo May 13 '16 at 01:16
  • @maxymoo i am using the ruby aws sdk, and yes its hadoop-streaming – Petros Kyriakou May 13 '16 at 01:17
  • You could probably limit the number of reducers (eg set) to 1 using `mapreduce.job.reduces`. See also: [Setting the number of map tasks and reduce tasks](https://stackoverflow.com/questions/6885441/setting-the-number-of-map-tasks-and-reduce-tasks) – John Rotenstein May 13 '16 at 05:03

1 Answers1

1

You have to run a script to merge the part files

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

or you could write them to an external database within your reducers, and then rip your result out of that. For one project I did I found that HBase was very useful for this

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Mark Giaconia
  • 3,844
  • 5
  • 20
  • 42
  • i should propably say that i am writing in an S3 bucket, does this work the same? – Petros Kyriakou May 13 '16 at 01:18
  • hmmm... don't know about that, doubt it. I assumed you were writing to HDFS native. – Mark Giaconia May 13 '16 at 01:20
  • 1
    petros don't write to an s3 bucket, write to the hdfs, then you can upload after doing the getmerge – maxymoo May 13 '16 at 01:21
  • hmm the thing is i need to run that command using either the GUI for now and then later the AWS SDK. i assume that command is executed using the aws CLI ? – Petros Kyriakou May 13 '16 at 01:24
  • surely the ruby aws sdk has a way to upload to s3, that's one of core features of aws – maxymoo May 13 '16 at 01:28
  • my workflow is like this, i deploy from a ruby app, an EMR cluster and add step, that step gets executeed and results saved to s3 bucket directly. I do not know how to configure HDFS native through the sdk. – Petros Kyriakou May 13 '16 at 01:34