Using s3 as fs.default.name or HDFS?

Question

I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?

Thanks!

Can you share config? it's not working for me. It's showing like `ls: Permission denied: s3n://vhdsamrat/user/root` — gwthm.in, Sep 26 '17 at 17:25

score 5 · Accepted Answer · answered Aug 25 '11 at 21:24

in order to use S3 instead of HDFS fs.name.default in core-site.xml needs to point to your bucket:

<property>
        <name>fs.default.name</name>
        <value>s3n://your-bucket-name</value>
</property>

It's recommended that you use S3N and NOT simple S3 implementation, because S3N is readble by any other application and by yourself :)

Also, in the same core-site.xml file you need to specify the following properties:

fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey

fs.s3n.awsSecretAccessKey

score 1 · Answer 2 · answered Nov 08 '16 at 11:35

1

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml

fs.default.name is deprecated, and maybe fs.defaultFS is better.

answered Nov 08 '16 at 11:35

Zhen Zeng

61
7

Will this make the intermediate data get saved to S3 too? Is there a way to keep the intermediate data local? – Sal Mar 16 '18 at 20:56

score 1 · Answer 3 · answered Jun 15 '11 at 04:21

1

Any intermediate data of your job goes to HDFS, so yes, you still need a namenode and datanodes

answered Jun 15 '11 at 04:21

mat kelcey

3,077
2
30
35

1

I've been able to get everything running against s3. If you specify dfs.name.default to be on s3 (or s3n), intermediate results go there. – xinit Jun 15 '11 at 07:07
why would you do this? the latency of s3 is _much_ more than hdfs and the intermediate data is effectively disposable. – mat kelcey Jun 16 '11 at 19:54
Since I want automatic scaling (up and down) of the cluster. With HDFS, that's not going to work. And I use s3, since we have a lot of data and not enough storage on a local cluster. – xinit Jun 17 '11 at 18:32
How do you find the latency hit? I'm surprised to hear that if you have too much intermediate data for HDFS that it's even usable against S3. – mat kelcey Jul 01 '11 at 17:48

score 0 · Answer 4 · answered Aug 03 '16 at 02:15

I was able to get the s3 integration working using

<property>
        <name>fs.default.name</name>
        <value>s3n://your-bucket-name</value>
</property>

in the core-site.xml and get the list of the files get using hdfs ls command.but should also should have namenode and separate datanode configurations, coz still was not sure how the data gets partitioned in the data nodes.

should we have local storage for namenode and datanode?

Using s3 as fs.default.name or HDFS?

4 Answers4

Linked