I am trying to connect to redshift from spark 2.1.0 standalone cluster on AWS with hadoop 2.7.2 and allxuio, which gives me this error: Exception in thread "main" java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager <init(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
From what I understand issue is in the:
Note on Amazon SDK dependency: This library declares a provided dependency on components of the AWS Java SDK. In most cases, these libraries will be provided by your deployment environment. However, if you get ClassNotFoundExceptions for Amazon SDK classes then you will need to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of your build / runtime configuration. See the comments in project/SparkRedshiftBuild.scala for more details.
as described in spark-redshift-databricks, I have tried all possible combination of class-path jars with the same error. My spark submit where I placed all jars is as below:
/usr/local/spark/bin/spark-submit --class com.XX.XX.app.Test --driver-memory 2G --total-executor-cores 40 --verbose --jars /home/ubuntu/aws-java-sdk-s3-1.11.79.jar,/home/ubuntu/aws-java-sdk-core-1.11.79.jar,/home/ubuntu/postgresql-9.4.1207.jar,/home/ubuntu/alluxio-1.3.0-spark-client-jar-with-dependencies.jar,/usr/local/alluxio/core/client/target/alluxio-core-client-1.3.0-jar-with-dependencies.jar --master spark://XXX.eu-west-1.compute.internal:7077 --executor-memory 4G /home/ubuntu/QAe.jar qa XXX.eu-west-1.compute.amazonaws.com 100 --num-executors 10 --conf spark.executor.extraClassPath=/home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-class-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar:/home/ubuntu/postgresql-9.4.1207.jar --driver-library-path /home/ubuntu/aws-java-sdk-s3-1.11.79.jar:/home/ubuntu/aws-java-sdk-core-1.11.79.jar --driver-library-path com.amazonaws.aws-java-sdk-s3:com.amazonaws.aws-java-sdk-core.jar --packages databricks:spark-redshift_2.11:3.0.0-preview1,com.amazonaws:aws-java-sdk-s3:1.11.79,com.amazonaws:aws-java-sdk-core:1.11.79
My built.sbt:
libraryDependencies += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.4"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.79"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.79"
libraryDependencies += "org.apache.avro" % "avro-mapred" % "1.8.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-redshift" % "1.11.78"
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1"
libraryDependencies += "org.alluxio" % "alluxio-core-client" % "1.3.0"
libraryDependencies += "com.taxis99" %% "awsscala" % "0.7.3"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion
code is simply read from postgresql and write to redshift:
val df = spark.read.jdbc(url_read,"public.test", prop).as[Schema.Message.Raw]
.filter("message != ''")
.filter("from_id >= 0")
.limit(100)
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://test.XXX.redshift.amazonaws.com:5439/test?user=test&password=testXXXXX")
.option("dbtable", "table_test")
.option("tempdir", "s3a://redshift_logs/")
.option("forward_spark_s3_credentials", "true")
.option("tempformat", "CSV")
.option("jdbcdriver", "com.amazon.redshift.jdbc42.Driver")
.mode(SaveMode.Overwrite)
.save()
I have all listed jar files on all cluster nodes under /home/ubuntu/ as well.
Does anyone understand how to add explicit dependencies on com.amazonaws.aws-java-sdk-core and com.amazonaws.aws-java-sdk-s3 as part of build / runtime configuration in spark? Or is the issue with jars themselves: did I use wrong version 1.11.80 or .. 79 etc ? Do I need to exclude these libraries from build.sbt? Will changing hadoop to 2.8 solve the issue ?
Here are useful links that I based the test on: Dependency Management with Sparkere , Add jars to a Spark Job - spark-submit