9

I have add blew jars to spark/jars path.

  • hadoop-aws-2.7.3.jar
  • aws-java-sdk-s3-1.11.126.jar
  • aws-java-sdk-core-1.11.126.jar
  • spark-2.1.0

In spark-shell

scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "***")

scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "***")

scala> val f = sc.textFile("s3a://bucket/README.md")

scala> f.count

java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) at org.apache.spark.rdd.RDD.count(RDD.scala:1157) ... 48 elided

  1. "java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager" is raised by mismatched jar? (hadoop-aws, aws-java-sdk)

  2. To access data stored in Amazon S3 from Spark applications should use Hadoop file APIs. So is hadoop-aws.jar contains the Hadoop file APIS or must run hadoop env ?

Nelson
  • 183
  • 3
  • 9

1 Answers1

23

Mismatched JARs; the AWS SDK is pretty brittle across versions.

Hadoop S3A code is in hadoop-aws JAR; also needs hadoop-common. Hadoop 2.7 is built against AWS S3 SDK 1.10.6. (*updated: No, it's 1.7.4. The move to 1.10.6 went into Hadoop 2.8)HADOOP-12269

You must use that version. If you want to use The 1.11 JARs then you will need to check out the hadoop source tree and build branch-2 yourself. The good news: that uses the shaded AWS SDK so its versions of jackson and joda time don't break things. Oh, and if you check out spark master, and build with the -Phadoop-cloud profile, it pulls the right stuff in to set Spark's dependencies up right.

Update: Oct 1 2017: Hadoop 2.9.0-alpha and 3.0-beta-1 use 1.11.199; assume the shipping versions will be that or more recent.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • 4
    "Hadoop S3A code is in hadoop-aws JAR; also needs hadoop-common. Hadoop 2.7 is built against AWS S3 SDK 1.10.6." How can an end-user figure out details like this for themselves? I'm getting as similar error as the OP with `hadoop-aws:2.7.4`, and I'm not sure exactly what set of things and versions I need to pass to `spark-submit` to get `s3a://` working as expected. I've tried several versions of `aws-java-sdk` without success. – Nick Chammas Oct 03 '17 at 20:50
  • I tried looking [here](https://github.com/apache/hadoop/blob/release-2.7.4-RC0/hadoop-tools/hadoop-aws/pom.xml#L120-L124) for clues, but there is no version specification. I'm not familiar with Java development, so perhaps I'm missing where to look. – Nick Chammas Oct 03 '17 at 20:50
  • oh, you are so close. In a large maven project you define the versions of things (along with cruft you don't want) in one file, and then reference them. Here you go https://github.com/apache/hadoop/blob/release-2.7.4-RC0/hadoop-project/pom.xml#L657 - 1.7.4 – stevel Oct 03 '17 at 21:04
  • Ah, OK! Getting closer, but I guess there is still a missing piece. I'm passing `hadoop-aws:2.7.4` to `spark-submit --packages`. With it, I can read a single-part ORC dataset on S3 but not a multi-part Parquet dataset. The multi-part dataset gives the `NoSuchMethodError: TransferManager` error. If I add `aws-java-sdk:1.7.4` to my list of `--packages`, it doesn't seem to help. I get the same error on the multi-part dataset. And looking through the Hadoop repo, I don't see any other AWS-related dependencies mentioned in that POM file. – Nick Chammas Oct 03 '17 at 21:25
  • I think my issues are related to EMR/YARN. If I drop `--master yarn` from my `spark-submit` invocation, I can do everything I need to do with just `--packages hadoop-aws:2.7.4`. Guess I should take this to the EMR forums. Currently working with `s3a://`, an EMR cluster, and a remote EC2 Spark client. – Nick Chammas Oct 03 '17 at 21:26