I am new to spark. I am facing a very strange problem. Trying to develop a spark application that gets JSON data from a web api and puts the data into a Hive table. I've divided my program into two parts:
1.) Access.java - Connects to the Web Api and gets the Json
2.) test.scala - parses the json and writes the json to a hive table(This is where spark comes into picture)
Here is my code:
1.)Access.java:
public class Access {
JSONArray getToes(){
CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("xxxxxx","xxxxxxx"));
HttpClientContext localContext = HttpClientContext.create();
localContext.setCredentialsProvider(credentialsProvider);
HttpHost proxy = new HttpHost("xxxxxxxxxxxxxxxxxxx", 8080, "http");
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
HttpClient httpClient = HttpClients.custom().setDefaultCredentialsProvider(credentialsProvider).build();
HttpGet toesGet = new HttpGet("https://api.riskrecon.com/v0.1/toes");
toesGet.setConfig(config);
toesGet.setHeader("Accept","Application/Json");
toesGet.addHeader("Authorization","Bearer xxxxxxxx");
try {
HttpResponse toes = httpClient.execute(toesGet);
System.out.println(toes.getStatusLine());
//System.out.println(toes.getAllHeaders().toString());
System.out.println(toes.getEntity().toString());
if(toes.getStatusLine().getStatusCode() == 200) {
JSONParser parser = new JSONParser();
JSONArray arr = (JSONArray) parser.parse(EntityUtils.toString(toes.getEntity()));
System.out.println(arr);
return arr;
}
} catch (Exception e){
e.printStackTrace();
}
return null;
}
}
2.)test.scala:
object test {
def main(args: Array[String]): Unit = {
println("Hello, world!")
val acc = new Access
val arr = acc.getToes()
print(arr)
//System.setProperty("hadoop.home.dir", "path") //For running on windows
val conf = new SparkConf().setAppName("RiskRecon")//.setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val hiveContext = new HiveContext(sc)
val obj = arr.get(0)
val rdd = sc.parallelize(Seq(arr.toString))
val dataframe = hiveContext.read.json(rdd)
println("Size of Json "+arr.size())
println("Size of dataframe "+dataframe.count())
println(dataframe.show())
print(dataframe.getClass)
}
}
I adding all the dependencies through maven. Here's my POM.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Test_Scala_project</groupId>
<artifactId>xxxxxxxx</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>TestJar</name>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.googlecode.json-simple/json-simple -->
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
<version>1.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.0</version>
</dependency>
</dependencies>
</project>
Now, I am able to package the jar and do a spark-submit on windows. But I face errors when I do spark-submit
on Linux. This is my command:
spark-submit --verbose --master yarn --class test app.jar
It gives me an error saying: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext
Then I add the required jars and try to run it again:
spark-submit --verbose --master yarn --class test --jars httpclient-4.5.2.jar,httpcore-4.4.4.jar app.jar
Now, I get this weird error:
Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:966)
at Access.getToes(Access.java:29)
at test$.main(test.scala:9)
at test.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've searched for similar errors, but nothing were in context of spark. But I found a couple of links that were similar:
NoSuchMethodError while running AWS S3 client on Spark while javap shows otherwise
The error looks like a mismacthed dependency issue. In the first link above, he mentions that we spark uses httpclient 4.1.2 internally. I have 2 things regarding this:
1.) If there is a default library for httpclient in spark, why does it throw me 'classnotfound exception' when I ran the application without adding the http libraries?
2.) I've tried to include 4.1.2 versions of both httpclient and httpcore in the command and run again. This is the command:
spark-submit --verbose --master yarn --class test --jars httpclient-4.1.2.jar,httpcore-4.1.2.jar app.jar
It again gives me the ClassNotFound error:
java.lang.NoClassDefFoundError: org/apache/http/client/protocol/HttpClientContext
This is so strange. It gives me different errors with different versions of libraries. I tried changing my http versions in my pom.xml as well. I also tried removing the http dependencies from pom.xml completely as spark libraries have it internally(from what I've known so far), but it still gives me the same errors.
I've also tried packaging Access.java as a separate application and ran it using java -jar command. It runs fine without any errors. The problems arise only when there are spark libraries involved.
I also tried to package the app as an uber jar and run it. Still the same errors show up.
What is causing this issue? Does spark use other versions of httpcore and httpclient(other than the ones I've tried)? What would be the best solution?
Right now, I can only think of Separating the application into two, one part handles the json and saves it as a text file and the other part is the spark application which populates hive table. I don't think packaging a custom spark jar with required versions of http components will work for me as I'll be running it on a cluster and can't change the default library.
FIX:
I've tried to use maven-shader-plugin as pointed by cricket_007 in his comment. I added the following lines to my pom.xml:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.http</pattern>
<shadedPattern>org.shaded.apache.http</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>shaded</shadedClassifierName>
</configuration>
</execution>
</executions>
</plugin>
The program now runs without any errors! Hope this helps someone else.