22

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

I have consulted the following resources:

Parsing files from Amazon S3 with Apache Spark

How to access s3a:// files from Apache Spark?

Hortonworks Spark 1.6 and S3

Cloudera

Custom s3 endpoints

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key 
spark.hadoop.fs.s3a.secret.key=secret-key

I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar

In the shell, here is how I try to load data from the S3 bucket:

val p = spark.read.textFile("s3a://sparkcookbook/person")

And here is the error that results:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

When I instead try to start the Spark shell as follows:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

:: problems summary ::
:::: ERRORS
    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

And here is the second:

val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)

Could someone suggest how to get this working? Thanks.

Shafique Jamal
  • 1,550
  • 3
  • 21
  • 45

1 Answers1

25

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
  • 4
    Thanks - this did it. Also, in case anyone else is trying this approach, before starting the spark shell I had to set two environment variables as follows: `export AWS_ACCESS_KEY_ID="access-key"` and `export AWS_SECRET_ACCESS_KEY="secret-key"`. – Shafique Jamal Aug 19 '17 at 20:59
  • 3
    Thanks for the info. Would you please tell me how you found the versions of `hadoop-aws` and `aws-java-sdk` that are compatible with spark 2.2? – FloWi Jan 12 '18 at 19:03
  • 3
    It is easy! By default, Spark 2.2 is pre-built for Hadoop 2.7.x. So, we have to use `hadoop-aws` v2.7.x. Which has a compile dependency of `aws-java-sdk` v1.7.4. If we build Spark 2.2 with Hadoop 2.8.x, then we have to use `hadoop-aws` v2.8.x & `aws-java-sdk` v1.10.6. For further info. you can refer to their maven repos. – himanshuIIITian Jan 14 '18 at 05:39
  • 1
    I'm also confused about these dependency versions. I'm looking at Spark 2.2.1 Maven page right now and the hadoop version is 2.6.5? https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11/2.2.1 – lfk Feb 17 '18 at 05:07
  • @lfk yes...you are right!!! Spark 2.2.1 has default Hadoop v2.6.5. But, in my comment I said that - "By default, Spark 2.2 is pre-built for Hadoop 2.7.x", i.e., the pre-built version of Spark 2.2 comes with Hadoop 2.7.x (as mentioned here - http://spark.apache.org/downloads.html). So, it all depends on the Hadoop version used for building Spark. – himanshuIIITian Feb 18 '18 at 06:13
  • 1
    Okay. So my Spark was installed through `pip install pyspark` and all I know is it's 2.2.1. How do I find out which Hadoop it was built with? Can't even figure out where it's installed :/ – lfk Feb 20 '18 at 01:41
  • @lfk I don't have much experience with `pyspark`. Maybe you can post a question on SO for it and someone with more experience will be able to answer it. – himanshuIIITian Feb 20 '18 at 04:31
  • 1
    Found that out. It's built with 2.7.3. Does that mean I'm limited to this version of Hadoop or is it still possible to somehow use a newer version when calling spark-submit? – lfk Feb 21 '18 at 06:03
  • @lfk No, we are not bound to use Hadoop v2.7.3. We can use any version we want. But for that, we have to build Spark's source code using following command - `./build/mvn -Phadoop-2.8,yarn,mesos,hive,hive-thriftserver -DskipTests clean install` – himanshuIIITian Feb 21 '18 at 12:13
  • 1
    @himanshuIIITian Are you sure I can build with 2.8? I'm getting `The requested profile "hadoop-2.8" could not be activated because it does not exist.` – lfk Feb 26 '18 at 04:28
  • @lfk My bad...Spark cannot be built with Hadoop v2.8 as the profile does not exist. It seems like we are stuck with Hadoop v2.7 for now - https://github.com/apache/spark/blob/master/pom.xml#L2657 – himanshuIIITian Feb 26 '18 at 04:40
  • 2
    @himanshuIIITian Solved. I installed Hadoop 2.8 using brew (had to modify the formula for 2.8.2 as I wanted 2.8.3, which is what latest Amazon EMR uses). I then manually installed Spark 2.2.1 from Hadoop-free binaries (on the download page) and used these instructions to point it to my Hadoop installation: https://spark.apache.org/docs/2.1.0/hadoop-provided.html – lfk Feb 26 '18 at 22:13