0

I am getting java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders exception while submitting my spark job using spark-submit

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
    at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:740)
    at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.<init>(SimpleAWSCredentialsProvider.java:58)
    at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:600)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:260)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)

Spark version Installed: 2.4.5

My Configuration: build.gradle:

buildscript {
    repositories {
        maven {
            url "https://*********/****/content/repositories/thirdparty"
            credentials {
                username ****User
                password ****Pwd
            }
        }
    }
}

plugins {
    id 'java'
    id 'com.github.johnrengelman.shadow' version '1.2.3'
}

group 'com.felix'
version '1.0-SNAPSHOT'

sourceCompatibility = 1.8

repositories {
    mavenCentral()
    mavenLocal()
}

dependencies {
    compile group: 'org.apache.spark', name: 'spark-hadoop-cloud_2.11', version: '2.4.2.3.1.3.0-79'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-sql
    compileOnly group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind
    compileOnly group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.6.7.3'
    // https://mvnrepository.com/artifact/org.apache.parquet/parquet-column
    compileOnly group: 'org.apache.parquet', name: 'parquet-column', version: '1.10.1'
    // https://mvnrepository.com/artifact/org.apache.parquet/parquet-hadoop
    compileOnly group: 'org.apache.parquet', name: 'parquet-hadoop', version: '1.10.1'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-sketch
    compileOnly group: 'org.apache.spark', name: 'spark-sketch_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-core
    compileOnly group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-catalyst
    compileOnly group: 'org.apache.spark', name: 'spark-catalyst_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-tags
    compileOnly group: 'org.apache.spark', name: 'spark-tags_2.11', version: '2.4.5'
    compileOnly group: 'org.apache.spark', name: 'spark-avro_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/org.apache.spark/spark-hive
    compileOnly group: 'org.apache.spark', name: 'spark-hive_2.11', version: '2.4.5'
    // https://mvnrepository.com/artifact/org.apache.xbean/xbean-asm6-shaded
    compile group: 'org.apache.xbean', name: 'xbean-asm7-shaded', version: '4.15'
    // https://mvnrepository.com/artifact/org.codehaus.janino/commons-compiler
    compileOnly group: 'org.codehaus.janino', name: 'commons-compiler', version: '3.0.9'
    // https://mvnrepository.com/artifact/org.codehaus.janino/janino
    compileOnly group: 'org.codehaus.janino', name: 'janino', version: '3.0.9'

    //HIVE Metastore
    compile group: 'org.postgresql', name: 'postgresql', version: '42.2.9'

    compile 'com.google.guava:guava:22.0'
    compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.2.1'
    compile group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.2.1'
    compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '3.2.1'
    compile group: 'com.amazonaws', name: 'aws-java-sdk-bundle', version: '1.11.271'

    compile group: 'io.delta', name: 'delta-core_2.11', version: '0.5.0'

    compile group: 'joda-time', name: 'joda-time', version: '2.10.5'
    compile group: 'org.projectlombok', name: 'lombok', version: '1.16.6'
}

shadowJar {
    zip64 true
}

Spark job:

df.distinct()
               .withColumn("date", date_format(col(EFFECTIVE_PERIOD_START), "yyyy-MM-dd"))
               .repartition(col("date"))
               .write()
               .format(fileFormat)
               .partitionBy("date")
               .mode(SaveMode.Append)
               .option("fs.s3a.committer.name", "partitioned")
               .option("fs.s3a.committer.staging.conflict-mode", "append")
               .option("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
               .option("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
               .option("compression", compressionCodecName.name().toLowerCase())
               .save(DOWNLOADS_NON_COMPACT_PATH);

Executed script: spark-submit --class com.felix.DataIngestionApplication --master local DataIngestion-1.0-SNAPSHOT-all.jar

From what I understand is the hadoop version is creating the issue All the hadoop-* JARs need to be 100% matching on versions. So I have ensured that all org.apache.hadoop dependencies are of the same version (3.2.1). But still it's giving this error.

I want to use hadoop version 3 or newer since that provides newer S3A committers like "PartitionedStagingCommitter". How does everybody using this with Spark 2.4.5?

How can I force/override hadoop version to use as 3.2.1 instead of hadoop versions in Spark/jars? When I looked at /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/ I could see hadoop-common-2.7.3.jar, hadoop-client-2.7.3.jar etc. So how do we force hadoop newer version and therefore I could leverage new S3A comitters.?

Note: If I don't using spark-submit and instead run the application from IntelliJ with all dependencies as compile, then the app starts and executes without exception. I could see the data getting inserted in S3.

Felix K Jose
  • 782
  • 7
  • 10

2 Answers2

0

I have got this working by installing Spark distribution without Hadoop (user provided Hadoop version option) Then install hadoop version 3.2.1 (brew install hadoop) and create spark-env.sh from spark-env.sh.template file and add following line in spark-env.sh (/usr/local/spark-2.4.5/conf/):

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Now when I run spark-submit, the job executed without any issues

Felix K Jose
  • 782
  • 7
  • 10
  • hi, I tried your suggestion but it does not work for me. I am getting disappointed. my pays-ark works well if I make a query on hdfs but the I try to make a query on hive external table I get the error above. – Frank May 16 '20 at 22:33
0

You can use force option in gradle.

compile(group: 'org.apache.hadoop', name: 'hadoop-aws', version: '3.1.1') {
    force = true
}
compile(group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.1.1') {
    force = true
}