1

I'm trying to integrate Spark 2.3.0 running on my Mac with S3. I can read/write to S3 without any problem using spark-shell. But when I try to do the same using a little Scala program that I run via sbt, I get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider.

I have installed hadoop-aws 3.0.0-beta1. I have also set s3 access information in spark-2.3.0/conf/spark-defaults.conf:

spark.hadoop.fs.s3a.impl                              org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key                        XXXX
spark.hadoop.fs.s3a.secret.key                        YYYY
spark.hadoop.com.amazonaws.services.s3.enableV4       true
spark.hadoop.fs.s3a.endpoint                          s3.us-east-2.amazonaws.com
spark.hadoop.fs.s3a.fast.upload                       true
spark.hadoop.fs.s3a.encryption.enabled                true
spark.hadoop.fs.s3a.server-side-encryption-algorithm  AES256

The program compiles fine using sbt version 0.13.

name := "S3Test"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies +=  "org.apache.hadoop" % "hadoop-aws" % "3.0.0-beta1"

The scala code is:

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import com.amazonaws._
import com.amazonaws.auth ._
import com.amazonaws.services.s3 ._
import com.amazonaws. services.s3.model ._
import java.io._
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.s3a.S3AFileSystem
object S3Test {
    def main(args: Array[String]) = {
        val spark = SparkSession.builder().master("local").appName("Spark AWS S3 example").getOrCreate()
        import spark.implicits._
        val df  = spark.read.text("test.txt") 
        df.take(5)
        df.write.save(<s3 bucket>)
    }
}

I have set environment variables for JAVA_HOME, HADOOP_HOME, SPARK_HOME, CLASSPATH, SPARK_DIST_CLASSPATH, etc. But nothing lets me get past this error message.

Dean Sha
  • 837
  • 1
  • 10
  • 15

1 Answers1

2

You can't mix hadoop-* JARs, they all need to be in perfect sync. Which means: cut all the hadoop 2.7 artifacts & replace them.

FWIW, there isn't a significant enough difference between Hadoop 2.8 & Hadoop 3.0-beta-1 in terms of aws support, other than the s3guard DDB directory service (performance & listing through dynamo DB), that unless you need that feature, Hadoop 2.8 is going to be adequate.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Thanks Steve. This definitely helped. I got rid off all the aws related jars, commented all the aws related import statements in the code. I downloaded aws-java-sdk-1.11.179.jar and hadoop-aws-2.8.1.jar and put them in the CLASSPATH. Now spark-submit works fine. Earier, it only worked from spark-shell. However, my sbt run still doesn't work. I changed my build.sbt to get rid off hadoop-aws dependency and instead added these 2 jars as dependency. Now sbt run is giving java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException. Perhaps sbt is not picking aws settings from spark. – Dean Sha Nov 07 '17 at 00:36
  • you are going to need to pull in the matching amazon-s3-SDK, I'm afraig – stevel Nov 14 '17 at 11:46