2

I've been through all the threads on the dependencies for connecting spark running on an aws EMR to an s3 bucket, however my issue seems to be slightly different. In all of the other discussions I have seen, the s3 and s3a protocols have the same dependencies. Not sure why one is working for me while the other is not. Currently, running spark in local mode, s3a does the job just fine, but my understanding is that s3 is what's supported running on EMR (due to its reliance on HDFS block storage). What am I missing for the s3 protocol to work?

spark.read.format("csv").load("s3a://mybucket/testfile.csv").show()
//this works, displays the df

versus

spark.read.format("csv").load("s3://mybucket/testfile.csv").show()
/*
java.io.IOException: No FileSystem for scheme: s3
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 51 elided
*/
S Dub
  • 63
  • 1
  • 5

1 Answers1

1

Apache Hadoop provides the following filesystem clients for reading from and writing to Amazon S3:

  1. S3 (URI scheme: s3) - Apache Hadoop implementation of a block-based filesystem backed by S3.

  2. S3A (URI scheme: s3a) - S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB and up to 5TB, and it provides performance enhancements and other improvements.

  3. S3N (URI scheme: s3n) - A native filesystem for reading and writing regular files on S3. s3n supports objects up to 5GB in size

Reference:

Technically what is the difference between s3n, s3a and s3?

https://web.archive.org/web/20170718025436/https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/

hagarwal
  • 1,153
  • 11
  • 27
  • Right, the question isn't what's the difference. The question is why am I able to use s3a but not s3 protocol? Seems like the hadoop-aws jar dependencies should be the same (have tried with versions 2.73, 2.92, and 3.2.1). Is it because I am running in spark local mode, rather than on an EMR / hadoop cluster? – S Dub Oct 15 '19 at 15:24
  • 3
    There is no s3:// client in the ASF hadoop-aws JAR. A stack trace is inevitable – stevel Oct 24 '19 at 17:15