6

It looks like I am again stuck on the running a packaged spark app jar using spark submit. Following is my pom file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <parent>
        <artifactId>oneview-forecaster</artifactId>
        <groupId>com.dataxu.oneview.forecast</groupId>
        <version>1.0.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>
    <artifactId>forecaster</artifactId>

<dependencies>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
    </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.module</groupId>
        <artifactId>jackson-module-scala_${scala.binary.version}</artifactId>
    </dependency>
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <!--<scope>provided</scope>-->
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.11</artifactId>
        <version>2.2.0</version>
        <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>2.8.3</version>
        <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk</artifactId>
        <version>1.10.60</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/joda-time/joda-time -->
    <dependency>
        <groupId>joda-time</groupId>
        <artifactId>joda-time</artifactId>
        <version>2.9.9</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.8.0</version>
        <!--<scope>provided</scope>-->
    </dependency>
</dependencies>

<build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>${scala-maven-plugin.version}</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>com.dataxu.oneview.forecaster.App</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Following is a simple snippet of code which fetches data from s3 location and prints it:

def getS3Data(path: String): Map[String, Any] = {
    println("spark session start.........")
    val spark =  getSparkSession()

    val configTxt = spark.sparkContext.textFile(path)
        .collect().reduce(_ + _)

    val mapper = new ObjectMapper
    mapper.registerModule(DefaultScalaModule)
    mapper.readValue(configTxt, classOf[Map[String, String]])
}

When I run it from intellij, everything works fine. the log is clear and looks good. However, when I package it using mvn package and try to run it using spark submit, I end up getting the following error at the .collect.reduce(_ + _). Following is the error I encounter:

 "main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.s3a.S3AFileSystem.addDeprecatedKeys(S3AFileSystem.java:181)
at org.apache.hadoop.fs.s3a.S3AFileSystem.<clinit>(S3AFileSystem.java:185)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
...

I am not understanding which dependency was not packaged or what might be the issue as I did set the versions correctly expecting the hadoop aws should have all of them.

Any help will be appreciated.

Omkar
  • 2,274
  • 6
  • 21
  • 34

3 Answers3

7

The dependencies between hadoop and AWS JDK are very sensitive, and you should stick to using the correct versions that your hadoop dependency version was built with.

The first problem you need to solve is pick one version of Hadoop. I see you're mixing versions 2.8.3 and 2.8.0.

When I look at the dependency tree for org.apache.hadoop:hadoop-aws:2.8.0, I see that it is built against version 1.10.6 of the AWS SDK (same for hadoop-aws:2.8.3).

maven dependency tree

This is probably what's causing mismatches (you're mixing incompatible versions). So:

  • Choose the version of hadoop you want to use
  • Include hadoop-aws with the version compatible with your hadoop
  • Remove other dependencies, or only include them with versions matching the one compatible with your hadoop version.
ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • I am trying it, will get back asap – Omkar Mar 07 '18 at 18:40
  • I did remove the hadoop-common dependency and changed the hadoop version to 2.8.0 and aws-java-sdk to 1.10.6. I am getting another error which I am investigating: `Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation` – Omkar Mar 07 '18 at 18:52
  • 1
    Can you update the question with that info as well as with the updated version of your pom file? – ernest_k Mar 07 '18 at 18:54
  • 1
    It's the hadoop-aws and hadoop-core libs which aren't in sync; they both need to match to 2.8.0, 2.8.3, whatever. Same for jackson and spark itself versions. The work well, but only if you use the exact same numbers – stevel Mar 07 '18 at 19:16
3

In case anybody else is still stumbling on this error... it took me a while to find out, but check if your project has a dependency (direct or transitive) on the package org.apache.avro/avro-tools. It was brought into my code by a transitive dependency. Its problem is that it ships with a copy of org.apache.hadoop.conf.Configuration that is much older than all current versions of hadoop, so it may end up being the one picked up in the classpath.

In my scala project, I just had to exclude it with

 ExclusionRule("org.apache.avro","avro-tools")

and the error (finally!) disappear.

I am sure that the avro-tools coders had some good reason to include a copy of a file that belongs to another package (hadoop-common), I was really surprised to find it there and made me waste an entire day.

Roberto Congiu
  • 5,123
  • 1
  • 27
  • 37
0

In my case, I was running a local Spark installation on a Cloudera edge node and was hitting this conflict (even though I made sure to download Spark with the correct hadoop binaries precompiled). I just went into my Spark home and moved the hadoop-common jar so it wouldn't be loaded:

mv ~/spark-2.4.4-bin-hadoop2.6/jars/hadoop-common-2.6.5.jar ~/spark-2.4.4-bin-hadoop2.6/jars/hadoop-common-2.6.5.jar.XXXXXX

After that, it ran... in local mode anyway.

Clark Updike
  • 160
  • 1
  • 9