I have a text file in S3 that I would like to load into an RDD with spark-shell
.
I have downloaded Spark 2.3.0 for Hadoop. Naively, I would expect that I just need to set the hadoop settings and I'd be set.
val inFile = "s3a://some/path"
val accessKey = "some-access-key"
val secretKey = "some-secret-key"
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKey)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.textFile(inFile).count()
println(run())
Invoking the final line returns:
Failure(java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)
This seems to be asking that I provide the library which includes S3AFileSystem. No problem - I download the appropriate jar and add this line to the beginning of the script.
:require C:\{path-to-jar}\hadoop-aws-3.1.0.jar
Now, running the script fails at the final line with a variety of errors similar to this:
error: error while loading Partition, class file 'C:\spark\spark-2.3.0-bin-hadoop2.7\jars\spark-core_2.11-2.3.0.jar(org/apache/spark/Partition.class)' has location not matching its contents: contains class Partition
I'm lost at this point - clearly, it had no issue defining the run
method before.
I can access the Partition class myself directly, but something is happening above that prevents the code from accessing Partition.
scala> new org.apache.spark.Partition {def index = 3}
res6: org.apache.spark.Partition = $anon$1@3
Curiously, running the final line of the script yields a different error in subsequent invocations.
scala> sc.textFile(inFile).count()
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
at java.lang.ClassLoader.defineClass1(Native Method)
...
The documentation claims this is part of hadoop 3.1.0, which I'm using, but when exploring hadoop-aws-3.1.0.jar
I see no trace of StreamCapabilities
.
Is there a different jar I should be using? Am I trying to solve this problem incorrectly? Or, have I fallen into the XY problem trap?
Answers I tried
- The official docs assume I'm running the script on a cluster. But I'm running
spark-shell
locally. - This other StackOverflow question is for an older problem. I'm using s3a as a result, but am encountering a different problem.
- I also tried using every jar of Hadoop from 2.6 to 3.1, to no avail.