spark - redshift - s3: class path conflict

Question

I am trying to connect to redshift from spark 2.1.0 standalone cluster on AWS with hadoop 2.7.2 and allxuio, which gives me this error: Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager <init(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)

From what I understand issue is in the:

Note on Amazon SDK dependency: This library declares a provided dependency on components of the AWS Java SDK. In most cases, these libraries will be provided by your deployment environment. However, if you get ClassNotFoundExceptions for Amazon SDK classes then you will need to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of your build / runtime configuration. See the comments in project/SparkRedshiftBuild.scala for more details.

as described in spark-redshift-databricks, I have tried all possible combination of class-path jars with the same error. My spark submit where I placed all jars is as below:

/usr/local/spark/bin/spark-submit --class com.XX.XX.app.Test --driver-memory 2G --total-executor-cores 40 --verbose --jars /home/ubuntu/aws-java-sdk-s3-1.11.79.jar,/home/ubuntu/aws-java-sdk-core-1.11.79.jar,/home/ubuntu/postgresql-9.4.1207.jar,/home/ubuntu/alluxio-1.3.0-spark-client-jar-with-dependencies.jar,/usr/local/alluxio/core/client/target/alluxio-core-client-1.3.0-jar-with-dependencies.jar --master spark://XXX.eu-west-1.compute.internal:7077 --executor-memory 4G /home/ubuntu/QAe.jar qa XXX.eu-west-1.compute.amazonaws.com 100 --num-executors 10 --conf spark.executor.extraClassPath=/home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-class-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar:/home/ubuntu/postgresql-9.4.1207.jar --driver-library-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-library-path com.amazonaws.aws-java-sdk-s3:com.amazonaws.aws-java-sdk-core.jar --packages databricks:spark-redshift_2.11:3.0.0-preview1,com.amazonaws:aws-java-sdk-s3:1.11.79,com.amazonaws:aws-java-sdk-core:1.11.79

My built.sbt:

libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.4" 
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.79"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.79"
libraryDependencies += "org.apache.avro" % "avro-mapred" % "1.8.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.78"
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
libraryDependencies += "org.alluxio" % "alluxio-core-client" % "1.3.0"
libraryDependencies += "com.taxis99" %% "awsscala" % "0.7.3"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies +=  "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies +=  "org.apache.spark" %% "spark-mllib" % sparkVersion

code is simply read from postgresql and write to redshift:

   val df = spark.read.jdbc(url_read,"public.test", prop).as[Schema.Message.Raw]
  .filter("message != ''")
  .filter("from_id >= 0")
  .limit(100)


df.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://test.XXX.redshift.amazonaws.com:5439/test?user=test&password=testXXXXX")
  .option("dbtable", "table_test")
  .option("tempdir", "s3a://redshift_logs/")
  .option("forward_spark_s3_credentials", "true")
  .option("tempformat", "CSV")
  .option("jdbcdriver", "com.amazon.redshift.jdbc42.Driver")
  .mode(SaveMode.Overwrite)
  .save()

I have all listed jar files on all cluster nodes under /home/ubuntu/ as well.

Does anyone understand how to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of build / runtime configuration in spark? Or is the issue with jars themselves: did I use wrong version 1.11.80 or .. 79 etc ? Do I need to exclude these libraries from build.sbt? Will changing hadoop to 2.8 solve the issue ?

Here are useful links that I based the test on: Dependency Management with Sparkere , Add jars to a Spark Job - spark-submit

score 3 · Accepted Answer · answered Jan 20 '17 at 13:25

3

Amazon tends to change the API of their library fast enough that all releases of hadoop-aws.jar need to be in sync with the AWS SDK; for Hadoop 2.7.x that's v 1.7.4 of the SDK. As it stands, you probably aren't going to get redshift and s3a to coexist, though you may be able to continue to work with the older s3n URLs.

The update to a newer SDK will only surface in Hadoop > 2.8, when it moves up to 1.11.45. Why so delayed? Because that forces an update to Jackson, which ends up breaking everything else downstream.

Welcome to the world of transitive dependency JAR hell, and let us all hope that Java 9 will address that —though it will need someone (you?) to add all the relevant module declarations

answered Jan 20 '17 at 13:25

stevel

12,567
1
39
50

thx @Steve Loughran, it is jar hell indeed, i think i'd rather replace redshift with mogo or cassandra to avoid the hustle – elcomendante Jan 20 '17 at 14:21
1

there's a 3.0 beta of hadoop coming out which uses a shaded version of the AWS SDK (1.11.199); and a more-ready-to use Hadoop 2.9 alpha to follow. Because we've switched to the shaded ones, the AWS choice of jackson version doesn't cause so many problems, but its APIs are still a moving target. – stevel Oct 03 '17 at 19:04

spark - redshift - s3: class path conflict

1 Answers1