1

I have a Spark application fetching data from hdfs and ingesting data into S3.Below are the versions of different components i am using.

spark : 2.3.1 hadoop : 2.7.3 scala : 2.11.8

I am using hadoop-aws-2.7.3.jar, hadoop-common-2.7.3.jar and aws-java-sdk-1.7.4.jar. I followed some of the blogs related to hadoop and also referred mavenrepository site for getting the right combination of jars.

This is the code where i am uploading file to S3

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "<access_key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", 
 "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "<access_endpoint>")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
val wikipediaDataitems = spark.read.json("<some_json_file_in_hdfs>")
wikipediaDataitems.write.format("json").save("s3a://<bucket_name>/wikipedia.json")

Below is the error i am getting

Caused by: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>. 
(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class 
org.apache.hadoop.fs.s3a.S3AInstrumentation
      at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:163)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:185)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:112)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:146)

I did go through lot of stackoverflow questions who have faced the same issue and tried different combinations of hadoop-aws and hadop-common and aws-sdk jars, no luck so far.

Combinations tried so far and also mentioned relevant errors for each combination:

hadoop-aws-2.7.3.jar,hadoop-common-2.7.3.jar,aws-java-sdk-1.10.6.jar

org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:452)
  org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:548)
  org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:278)
  org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
  org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
 ... 49 elided
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

hadoop-aws-2.8.2.jar,hadoop-common-2.8.2.jar,aws-java-sdk-1.10.6.jar

java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)

hadoop-aws-2.7.3.jar,hadoop-common-2.7.3.jar,aws-java-sdk-1.11.123.jar Caused by: java.lang.ClassNotFoundException: com.amazonaws.event.ProgressListener at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 66 more

hadoop-aws-2.7.7.jar, hadoop-aws-2.7.7.jar and aws-java-sdk-1.7.4.jar

Caused by: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>. 
(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class 
org.apache.hadoop.fs.s3a.S3AInstrumentation
 at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:163)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:185)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:112)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:146)

Can anyone help me

  • could we peek at your pom (or other)? This seems like a binary incompatibility issue. – Saša Zejnilović Sep 12 '20 at 20:57
  • This is my pom org.apache.hadoop hadoop-aws 2.7.3 org.apache.hadoop hadoop-common 2.7.3 com.amazonaws aws-java-sdk 1.7.4 I have tried diff combinations building jar and also tried using spark-shell --jars. – mrutunjay chavadi Sep 14 '20 at 05:19
  • Have a look at mvnrepo to see what is needed. your first error message is "hadoop-common and hadoop-aws" out of sync; the second was "hadoop-aws and aws-s3-sdk out of sync". : https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws – stevel Sep 17 '20 at 17:24

1 Answers1

0

I faced the same issue when I was trying on my local. This is because of the conflict and incompatibility issues between the specified versions in your pom from various dependencies.

The below should work for Spark 2.4.5 with Scala 2.11

org.apache.hadoop:hadoop-aws:2.8.5
com.amazonaws:aws-java-sdk:1.11.659
org.apache.hadoop:hadoop-common:2.8.5

If you want to check on your local spark-shell, suggest to use the Hadoop Free build https://spark.apache.org/docs/2.4.0/hadoop-provided.html

Refer to the below stackoverflow thread for instructions on setting up an env on local to test with spark-shell.

Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

voidone
  • 333
  • 1
  • 3
  • 12