0

I am new to spark. I am facing a very strange problem. Trying to develop a spark application that gets JSON data from a web api and puts the data into a Hive table. I've divided my program into two parts:

1.) Access.java - Connects to the Web Api and gets the Json
2.) test.scala - parses the json and writes the json to a hive table(This is where spark comes into picture)

Here is my code:

1.)Access.java:

public class Access {
    JSONArray getToes(){

        CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
        credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("xxxxxx","xxxxxxx"));
        HttpClientContext localContext = HttpClientContext.create();
        localContext.setCredentialsProvider(credentialsProvider);

        HttpHost proxy = new HttpHost("xxxxxxxxxxxxxxxxxxx", 8080, "http");
        RequestConfig config = RequestConfig.custom().setProxy(proxy).build();

        HttpClient httpClient = HttpClients.custom().setDefaultCredentialsProvider(credentialsProvider).build();

        HttpGet toesGet = new HttpGet("https://api.riskrecon.com/v0.1/toes");

        toesGet.setConfig(config);
        toesGet.setHeader("Accept","Application/Json");
        toesGet.addHeader("Authorization","Bearer xxxxxxxx");

        try {
            HttpResponse toes = httpClient.execute(toesGet);
            System.out.println(toes.getStatusLine());
            //System.out.println(toes.getAllHeaders().toString());
            System.out.println(toes.getEntity().toString());

            if(toes.getStatusLine().getStatusCode() == 200) {
                JSONParser parser = new JSONParser();
                JSONArray arr = (JSONArray) parser.parse(EntityUtils.toString(toes.getEntity()));
                System.out.println(arr);
                return arr;
            }
        } catch (Exception e){
            e.printStackTrace();
        }
        return null;
    }
}

2.)test.scala:

object test {
  def main(args: Array[String]): Unit = {
    println("Hello, world!")
    val acc = new Access
    val arr = acc.getToes()
    print(arr)

    //System.setProperty("hadoop.home.dir", "path") //For running on windows
    val conf = new SparkConf().setAppName("RiskRecon")//.setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val hiveContext = new HiveContext(sc)
    val obj = arr.get(0)
    val rdd = sc.parallelize(Seq(arr.toString))
    val dataframe = hiveContext.read.json(rdd)

    println("Size of Json "+arr.size())
    println("Size of dataframe "+dataframe.count())
    println(dataframe.show())
    print(dataframe.getClass)
  }
}

I adding all the dependencies through maven. Here's my POM.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>Test_Scala_project</groupId>
  <artifactId>xxxxxxxx</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>TestJar</name>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.5.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpcore</artifactId>
      <version>4.4.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
    <dependency>
      <groupId>com.googlecode.json-simple</groupId>
      <artifactId>json-simple</artifactId>
      <version>1.1.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.10</artifactId>
    <version>1.6.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
   <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.10</artifactId>
    <version>1.6.0</version>
   </dependency>
 </dependencies>
</project>

Now, I am able to package the jar and do a spark-submit on windows. But I face errors when I do spark-submit on Linux. This is my command:

   spark-submit --verbose --master yarn --class test app.jar 

It gives me an error saying: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext

Then I add the required jars and try to run it again:

spark-submit --verbose --master yarn --class test --jars httpclient-4.5.2.jar,httpcore-4.4.4.jar app.jar

Now, I get this weird error:

Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
        at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:966)
        at Access.getToes(Access.java:29)
        at test$.main(test.scala:9)
        at test.main(test.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I've searched for similar errors, but nothing were in context of spark. But I found a couple of links that were similar:

http://apache-spark-developers-list.1001551.n3.nabble.com/Dependency-hell-in-Spark-applications-td8264.html

NoSuchMethodError while running AWS S3 client on Spark while javap shows otherwise

The error looks like a mismacthed dependency issue. In the first link above, he mentions that we spark uses httpclient 4.1.2 internally. I have 2 things regarding this:

1.) If there is a default library for httpclient in spark, why does it throw me 'classnotfound exception' when I ran the application without adding the http libraries?

2.) I've tried to include 4.1.2 versions of both httpclient and httpcore in the command and run again. This is the command:

spark-submit --verbose --master yarn --class test --jars httpclient-4.1.2.jar,httpcore-4.1.2.jar app.jar

It again gives me the ClassNotFound error:

java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext

This is so strange. It gives me different errors with different versions of libraries. I tried changing my http versions in my pom.xml as well. I also tried removing the http dependencies from pom.xml completely as spark libraries have it internally(from what I've known so far), but it still gives me the same errors.

I've also tried packaging Access.java as a separate application and ran it using java -jar command. It runs fine without any errors. The problems arise only when there are spark libraries involved.

I also tried to package the app as an uber jar and run it. Still the same errors show up.

What is causing this issue? Does spark use other versions of httpcore and httpclient(other than the ones I've tried)? What would be the best solution?

Right now, I can only think of Separating the application into two, one part handles the json and saves it as a text file and the other part is the spark application which populates hive table. I don't think packaging a custom spark jar with required versions of http components will work for me as I'll be running it on a cluster and can't change the default library.

FIX:

I've tried to use maven-shader-plugin as pointed by cricket_007 in his comment. I added the following lines to my pom.xml:

<plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.1</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
                <configuration>
                    <relocations>
                        <relocation>
                            <pattern>org.apache.http</pattern>
                            <shadedPattern>org.shaded.apache.http</shadedPattern>
                        </relocation>
                    </relocations>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                        </filter>
                    </filters>
                    <shadedArtifactAttached>true</shadedArtifactAttached>
                    <shadedClassifierName>shaded</shadedClassifierName>
                </configuration>

            </execution>
        </executions>
        </plugin>

The program now runs without any errors! Hope this helps someone else.

Hemanth Annavarapu
  • 823
  • 3
  • 19
  • 37
  • You need to use the Maven shade plugin to bundle all the jar files together – OneCricketeer Jul 11 '17 at 12:17
  • Thanks for the comment. I've tried to create a uber jar and it fails with the same errors as well. Will Maven shade plugin be any better? @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 12:18
  • That plugin creates an uber/fat jar for you. – OneCricketeer Jul 11 '17 at 12:20
  • I'll need to try that..Will there be any difference between adding the jars to classpath in the `spark-submit` and creating the uber jar? @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 12:23
  • 2
    The shade plugin should correctly handle overlapping classes between all dependencies. Otherwise, you can use any other HTTP library. Also, not sure why you need Scala&Java, personally I would like just pick one you're familiar with (you're already using Java 8) – OneCricketeer Jul 11 '17 at 12:27
  • I am new to spark and scala. So used java for API access. There should no issues using them together anyway, right? I am only facing problems when I'm running it on Linux @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 12:34
  • In my experience, you need a Scala compiler plugin for Maven, but they are compatible, yes. My point was only that you can write Spark code in Java. – OneCricketeer Jul 11 '17 at 12:40
  • Secondly, if you are "new to Spark", why are you using Spark 1.6? Is that's what's on the Hadoop cluster? – OneCricketeer Jul 11 '17 at 12:42
  • Yes. That's the version I have on the cluster. Also what did you mean by using any other HTTP library? @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 12:43
  • Okhttp, Async HTTP Client, play-ws, spray, etc... There's a large handful of Java (or Scala) http frameworks – OneCricketeer Jul 11 '17 at 12:45
  • I've tried to use maven-shader plugin to build the uber jar and tried to run it with spark-submit. The same error keeps showing up. Added the edit to the question. Can you please take a look? @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 14:52
  • 1
    If you look at the duplicate answer I flagged, the XML there has `org.apache.http`, you are only doing `org.apache.httpcomponents`. The error exists in `httpcore`, so you have missed that pattern – OneCricketeer Jul 11 '17 at 16:13
  • Thanks for this. I've made the change. Now the error doesn't show up. However, I get a new error: `java.lang.ClassNotFoundException: test`. What could be causing this? I didn't have this problem earlier. @cricket_007 – Hemanth Annavarapu Jul 11 '17 at 17:23
  • 1
    Like I said, you need the Maven Scala Plugin in order to compile Scala code – OneCricketeer Jul 11 '17 at 17:36
  • 1
    I've added the Maven scala dependency. It now works! You're awesome! thanks a lot! :) – Hemanth Annavarapu Jul 11 '17 at 19:03

0 Answers0