How to load RDDs from S3 files from spark-shell?

Question

I have a text file in S3 that I would like to load into an RDD with spark-shell.

I have downloaded Spark 2.3.0 for Hadoop. Naively, I would expect that I just need to set the hadoop settings and I'd be set.

val inFile = "s3a://some/path"
val accessKey = "some-access-key"
val secretKey = "some-secret-key"

sc.hadoopConfiguration.set("fs.s3a.access.key", accessKey)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)

sc.textFile(inFile).count()

println(run())

Invoking the final line returns:

Failure(java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)

This seems to be asking that I provide the library which includes S3AFileSystem. No problem - I download the appropriate jar and add this line to the beginning of the script.

:require C:\{path-to-jar}\hadoop-aws-3.1.0.jar

Now, running the script fails at the final line with a variety of errors similar to this:

error: error while loading Partition, class file 'C:\spark\spark-2.3.0-bin-hadoop2.7\jars\spark-core_2.11-2.3.0.jar(org/apache/spark/Partition.class)' has location not matching its contents: contains class Partition

I'm lost at this point - clearly, it had no issue defining the run method before.

I can access the Partition class myself directly, but something is happening above that prevents the code from accessing Partition.

scala> new org.apache.spark.Partition {def index = 3}
res6: org.apache.spark.Partition = $anon$1@3

Curiously, running the final line of the script yields a different error in subsequent invocations.

scala> sc.textFile(inFile).count()
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
  at java.lang.ClassLoader.defineClass1(Native Method)
  ...

The documentation claims this is part of hadoop 3.1.0, which I'm using, but when exploring hadoop-aws-3.1.0.jar I see no trace of StreamCapabilities.

Is there a different jar I should be using? Am I trying to solve this problem incorrectly? Or, have I fallen into the XY problem trap?

Answers I tried

The official docs assume I'm running the script on a cluster. But I'm running spark-shell locally.
This other StackOverflow question is for an older problem. I'm using s3a as a result, but am encountering a different problem.
I also tried using every jar of Hadoop from 2.6 to 3.1, to no avail.

score 1 · Accepted Answer · answered May 19 '18 at 20:36

1

org.apache.hadoop.fs.StreamCapabilities is in hadoop-common-3.1.jar You are probably mixing version of Hadoop JARs, which, as coved in the s3a troubleshooting docs is doomed.

Spark shell works fine with the right JARs in. But ASF Spark releases don't work with Hadoop 3.x yet, due to some outstanding issues. Stick to Hadoop 2.8.x and you'll get good S3 performance without so much pain.

answered May 19 '18 at 20:36

stevel

12,567
1
39
50

Hello Steve, I've been using a mix of spark 2.4 and hadoop 3 without issue so far is there known incompatibility ? – Kiwy Nov 26 '19 at 15:28

score 0 · Answer 2 · answered May 18 '18 at 21:14

I found a path that fixed the issue, but I have no idea why.

Create an SBT IntelliJ project
Include the below dependencies and overrides

Run the script (sans require statement) from sbt console

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.0"

dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"

The key part, naturally, is overriding the jackson dependencies.

How to load RDDs from S3 files from spark-shell?

Answers I tried

2 Answers2