72

When I connected my Hadoop cluster to Amazon storage and downloaded files to HDFS, I found s3:// did not work. When looking for some help on the Internet I found I can use S3n. When I used S3n it worked. I do not understand the differences between using S3 and S3n with my Hadoop cluster, can someone explain?

belka
  • 1,480
  • 1
  • 18
  • 31

3 Answers3

69

The two filesystems for using Amazon S3 are documented in the respective Hadoop wiki page addressing Amazon S3:

  • S3 Native FileSystem (URI scheme: s3n)
    A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

  • S3 Block FileSystem (URI scheme: s3)
    A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS using the S3 block filesystem (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from MapReduce, using either S3 filesystem. In the second case HDFS is still used for the Map/Reduce phase. [...]

[emphasis mine]

So the difference is mainly related to how the 5GB limit is handled (which is the largest object that can be uploaded in a single PUT, even though objects can range in size from 1 byte to 5 terabytes, see How much data can I store?): while using the S3 Block FileSystem (URI scheme: s3) allows to remedy the 5GB limit and store files up to 5TB, it replaces HDFS in turn.

Steffen Opel
  • 63,899
  • 11
  • 192
  • 211
  • 3
    My example files are about 60MB and in that case i could use either s3 or s3n but only s3n worked. If only difference is that 5GB file size limit then both s3 and s3n must work but did not.. –  May 14 '12 at 01:13
  • S3 supports up to 5 terabytes per object, it just needs to be uploaded in multiple parts, see: http://aws.amazon.com/s3/faqs/#How_much_data_can_I_store – Laurence Rowe Oct 29 '12 at 22:40
  • @LaurenceRowe: That's actually implied in the quotation, sort of (_can be larger than 5GB_), but thanks for pointing out the potentially confusing phrasing thereafter - I've tried to incorporate your comment to clarify this. – Steffen Opel Oct 30 '12 at 13:40
  • I have a question Steffen, I usually create HIVE external table with location on S3 and it works perfect. The file is BSON and using mongo-hadoop connector. But most of the time I have BSON files more than 5 GB something like 18GB. How can I create external table with that amount of file? I've already had my file in the bucket and don't mind if it's locked by only hadoop but it says if you choose S3 blocking file system you should not use an existing bucket containing files. How can I create external tables from files more than 5GB on S3? Thanks Steffen. – Maziyar Nov 23 '13 at 00:48
  • The 5Gb limit [was lifted in 2010](https://aws.amazon.com/blogs/aws/amazon-s3-object-size-limit/) – Sergey Orshanskiy Jun 03 '16 at 16:52
45

I think your main problem was related with having S3 and S3n as two separate connection points for Hadoop. s3n:// means "A regular file, readable from the outside world, at this S3 url". s3:// refers to an HDFS file system mapped into an S3 bucket which is sitting on AWS storage cluster. So when you were using a file from Amazon storage bucket you must be using S3N and that's why your problem is resolved. The information added by @Steffen is also great!!

belka
  • 1,480
  • 1
  • 18
  • 31
AvkashChauhan
  • 20,495
  • 3
  • 34
  • 65
  • I have got it why there was the problem. Thank you. –  May 14 '12 at 01:18
  • 1
    I believe that from within AWS EMR, both s3: and s3n: schemes are the same. Hadoop 2.x+ recommends using s3a: anyway. – DavidJ Aug 12 '16 at 19:20
  • 2
    For anyone stumbling across this now, the [aws docs](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html) now recommends the s3:// prefix over s3n:// – Papples Sep 12 '16 at 14:55
9

Here is an explanation: https://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html

The first S3-backed Hadoop filesystem was introduced in Hadoop 0.10.0 (HADOOP-574). It was called the S3 block fileystem and it was assigned the URI scheme s3://. In this implementation, files are stored as blocks, just like they are in HDFS. The files stored by this filesystem are not interoperable with other S3 tools - what this means is that if you go to the AWS console and try to look for files written by this filesystem, you won't find them - instead you would find files named something like block_-1212312341234512345 etc.

To overcome these limitations, another S3-backed filesystem was introduced in Hadoop 0.18.0 (HADOOP-930). It was called the S3 native filesystem and it was assigned the URI scheme s3n://. This filesystem lets you access files on S3 that were written with other tools... When this filesystem was introduced, S3 had a filesize limit of 5GB and hence this filesystem could only operate with files less than 5GB. In late 2010, Amazon... raised the file size limit from 5GB to 5TB...

Using the S3 block file system is no longer recommended. Various Hadoop-as-a-service providers like Qubole and Amazon EMR go as far as mapping both the s3:// and the s3n:// URIs to the S3 native filesystem to ensure this.

So always use the native file system. There is no more 5Gb limit. Sometimes you may have to type s3:// instead of s3n://, but just make sure that any files you create are visible in the bucket explorer in the browser.

Also see http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html.

Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

It also says you can use s3bfs:// to access the old block file system, previously known as s3://.

Community
  • 1
  • 1
Sergey Orshanskiy
  • 6,794
  • 1
  • 46
  • 50