1

I'm having an issue where my Hadoop job on AWS's EMR is not being saved to S3. When I run the job on a smaller sample, the job stores the output just fine. When I run the same command but on my full dataset, the job completes again, but there is nothing existing on S3 where I specified my output to go.

Apparently there was a bug with AWS EMR in 2009, but it was "fixed".

Anyone else ever have this problem? I still have my cluster online, hoping that the data is buried on the servers somewhere. If anyone has an idea where I can find this data, please let me know!

Update: When I look at the logs from one of the reducers, everything looks fine:

2012-06-23 11:09:04,437 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://myS3Bucket/output/myOutputDirFinal/part-00000' in S3
2012-06-23 11:09:04,439 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' writing to tempfile '/mnt1/var/lib/hadoop/s3/output-3834156726628058755.tmp'
2012-06-23 11:50:26,706 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' is being closed, beginning upload.
2012-06-23 11:50:26,958 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Outputstream for key 'output/myOutputDirFinal/part-00000' upload complete
2012-06-23 11:50:27,328 INFO org.apache.hadoop.mapred.Task (main): Task:attempt_201206230638_0001_r_000000_0 is done. And is in the process of commiting
2012-06-23 11:50:29,927 INFO org.apache.hadoop.mapred.Task (main): Task 'attempt_201206230638_0001_r_000000_0' done.

When I connect to this task's node, the temp directory mentioned is empty.

Update 2: After reading Difference between Amazon S3 and S3n in Hadoop, I'm wondering if my problem is using "s3://" instead of "s3n://" as my output path. In my both my small sample (that stores fine), and my full job, I used "s3://". Any thoughts on if this could be my problem?

Update 3: I see now that on AWS's EMR, s3:// and s3n:// both map to the S3 native file system (AWS EMR documentation).

Update 4: I re-ran this job two more times, each time increasing the number of servers and reducers. The first of these two finished with 89/90 reducer outputs being copied to S3. The 90th said it successfully copied according to logs, but AWS Support says file is not there. They've escalated this problem to their engineering team. My second run with even more reducers and and servers actually finished with all data being copied to S3 (thankfully!). One oddness though is that some reducers take FOREVER to copy the data to S3 -- in both of these new runs, there was a reducer whose output took 1 or 2 hours to copy to S3, where as the other reducers only took 10 minutes max (files are 3GB or so). I think this is relates to something wrong with the S3NativeFileSystem used by EMR (e.g. the long hanging -- which I'm getting billed for of course; and the alleged successful uploads that don't get uploaded). I'd upload to local HDFS first, then to S3, but I was having issues on this front as well (pending AWS engineering team's review).

TLDR; Using AWS EMR to directly store on S3 seems buggy; their engineering team looking into.

Community
  • 1
  • 1
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
  • EMR clusters can write data natively to S3 and HDFS. HDFS on these clusters is made from the ephemeral storage of the nodes and is only available for the duration of the cluster. In order to make sure it's a problem with S3, perhaps you could try running the problematic query on the entire dataset but storing the results on HDFS? If you see the results in HDFS after the query, it would most likely mean a problem with S3 or it's usage. Also, are you using the path as s3://... or s3n://... ? – Mark Grover Jun 23 '12 at 15:22
  • I was using `s3://` in my path. My full job has about 300 2gb files as input. When I run a sample job with 10 of these 2gb files, using same output syntax, it works fine (stores to s3). I peaked around on HDFS before shutting down the cluster, and didn't see any directories that seemed like they would contain the data (I killed the cluster though, so can't double check). Regarding rerunning the full job, and having output go to HDFS first, I could do that, but the cost is pretty high for me to fail another job. I'm hoping AWS staff reply to a duplicate of this I posted on their forums – Dolan Antenucci Jun 23 '12 at 15:45
  • @MarkGrover - I didn't realize there is a difference between s3:// and s3n://. Do you think using "s3://" could be the reason for my data not showing up? – Dolan Antenucci Jun 24 '12 at 17:32
  • Turns out on EMR, s3:// and s3n:// are the same thing. See up edit above. For now, I'm going to store the output of my job on HDFS, then use distcp to transfer over to S3 – Dolan Antenucci Jun 24 '12 at 18:38
  • What availability zone on S3 are you using? And is there only a single reducer? I can happily confirm that the FileOutputCommitter used for S3 is fraught with problems, especially for the US Standard (bi-coastal) AZ. – Judge Mental Jun 24 '12 at 19:52
  • I'm using us-east-1. There are 30 reducers (3 each on 10 cc1.4xlarge servers). The output of each should be about 80GB. – Dolan Antenucci Jun 24 '12 at 20:11
  • I finally got a resolution on this from Amazon -- it turned out to be a bug on their end. See my answer for details. Thanks all for the help – Dolan Antenucci Sep 12 '12 at 00:54

1 Answers1

1

This turned out to be a bug on AWS's part, and they've fixed it in the latest AMI version 2.2.1, briefly described in these release notes.

The long explanation I got from AWS is that when the reducer files are > the block limit for S3 (i.e. 5GB?), then multipart is used, but there was not proper error-checking going on, so that is why it would sometimes work, and other times not.

In case this continues for anyone else, refer to my case number, 62849531.

Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100