4

I am trying to use Apache Zeppelin (0.7.2, net install running locally on a Mac) to explore data loaded from an s3 bucket. The data seems to load just fine, as the command:

val p = spark.read.textFile("s3a://sparkcookbook/person")

gives the result:

p: org.apache.spark.sql.Dataset[String] = [value: string]

However, when I try to call methods on the object p, I get an error. For example:

p.take(1)

results in:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)

My conf/zeppelin-env.sh is the same as the default, except that I have amazon access key and secret key environment variables defined there. In the Spark interpreter in the Zeppelin notebook, I have added the following artifacts:

org.apache.hadoop:hadoop-aws:2.7.3  
com.amazonaws:aws-java-sdk:1.7.9    
com.fasterxml.jackson.core:jackson-core:2.9.0   
com.fasterxml.jackson.core:jackson-databind:2.9.0   
com.fasterxml.jackson.core:jackson-annotations:2.9.0

(I think only the first two are necessary). The two commands above work fine in the Spark shell, just not in the Zeppelin notebook (see How to use s3 with Apache spark 2.2 in the Spark shell for how that was set up).

So it seems that there is a problem with one of the Jackson libraries. Maybe I'm using the wrong artifacts above for the Zeppelin interpreter?

UPDATE: Following the advice in the proposed answer below, I removed the jackson jars that came with Zeppelin, and replaced them with the following:

jackson-annotations-2.6.0.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar

And replaced the artifacts with these, so my artifacts are now:

org.apache.hadoop:hadoop-aws:2.7.3  
com.amazonaws:aws-java-sdk:1.7.9    
com.fasterxml.jackson.core:jackson-core:2.6.7   
com.fasterxml.jackson.core:jackson-databind:2.6.7   
com.fasterxml.jackson.core:jackson-annotations:2.6.0

The error I get, however, from running the above commands is the same.

UDPATE2: As per I removed the jackson libraries from the list of artifacts, since they are already now in the jars/ folder - the only added artifacts are now the aws artifacts above. I then cleaned the classpath by entering the following in the notebook (as per the instructions):

%spark.dep
z.reset()

I get a different error now:

val p = spark.read.textFile("s3a://sparkcookbook/person")
p.take(1)

p: org.apache.spark.sql.Dataset[String] = [value: string]
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
  at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
  at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
  at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
  at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
  at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
  at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
  at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
  at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)

UPDATE3: As per the suggestion in a comment to the proposed answer below, I cleaned the class path by deleting all the files in the local repo:

rm -rf local-repo/*

I then restarted the Zeppelin server. To check the class path, I executed the following in the notebook:

val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)

This gave the following output (I include only the jackson libraries from the output here, otherwise the output is too long to paste):

...
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-annotations-2.6.0.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-core-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-databind-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-paranamer-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-scala_2.11-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/json4s-jackson_2.11-3.2.11.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/parquet-jackson-1.8.1.jar
...

It seems that multiple versions are fetched from the repo. Should I exclude the older versions? If so, how do I do that?

Shafique Jamal
  • 1,550
  • 3
  • 21
  • 45

2 Answers2

1

Use this jar versions;

aws-java-sdk-1.7.4.jar

hadoop-aws-2.6.0.jar

like in this script : https://github.com/2dmitrypavlov/sparkDocker/blob/master/zeppelin.sh do not use package but download the jars and put them in a path, let's say in "/root/jars/" then edit your zeppelin-env.sh; then run this command from zeppelin/conf dir;

echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh

after that restart zeppelin.

The code at the link above is pasted below (just in case the link becomes stale):

#!/bin/bash
# Download jars
cd /root/jars
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.39/mysql-connector-java-5.1.39.jar

cd /usr/share/
wget http://archive.apache.org/dist/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz
tar -zxvf zeppelin-0.7.1-bin-all.tgz
cd zeppelin-0.7.1-bin-all/conf
cp zeppelin-env.sh.template zeppelin-env.sh
echo 'export MASTER=spark://'$MASTERZ':7077'>>zeppelin-env.sh
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
echo 'export ZEPPELIN_NOTEBOOK_STORAGE="org.apache.zeppelin.notebook.repo.VFSNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo"'>>zeppelin-env.sh
echo 'export ZEPPELINHUB_API_ADDRESS="https://www.zeppelinhub.com"'>>zeppelin-env.sh
echo 'export ZEPPELIN_PORT=9999'>>zeppelin-env.sh
echo 'export SPARK_HOME=/usr/share/spark'>>zeppelin-env.sh

cd ../bin/
./zeppelin.sh
Shafique Jamal
  • 1,550
  • 3
  • 21
  • 45
Dmitry
  • 220
  • 1
  • 11
  • many thanks for this - it works. One important point to note is that before I was starting Zeppelin with `bin/zeppelin-daemon start`, whereas I noticed that in the github script you linked to, you start Zeppelin with `bin/zeppelin.sh`. That could have been my mistake all along (though the docs instruct to use the daemon). – Shafique Jamal Aug 28 '17 at 15:54
  • yes you can use the daemon start, it does not metter. – Dmitry Aug 28 '17 at 19:03
  • turns out it does matter. I get an error if I use the daemon, but not if I use the `zeppelin.sh`. – Shafique Jamal Aug 29 '17 at 22:09
0

You are probably using a too recent Jackson version. Even spark 2.3 is still on `2.6.7. Downgrade, and make sure that all your jackson JARs are consistent.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Thanks for responding. I tried what you suggested, and updated my questions with the results (there was no difference in the error). Did I downgrade correctly? – Shafique Jamal Aug 22 '17 at 19:01
  • 1
    You may have to clean up Zeppelin's classpath, to make sure that the higher versions are gone. That should happen by itself, but you may want to verifiy. also, you want to completely avoid adding those dependencies, since they are already in the classpath provided by Spark. No point in making those versions explicit, this will only lead to conflicts in the future. – Rick Moritz Aug 23 '17 at 11:46
  • @Rick Moritz: Thanks, I tried your suggestion (see updates to the question - did I clean the classpath correctly? How would I verify?). I'm now getting a different error (please see above). – Shafique Jamal Aug 23 '17 at 19:02
  • 1
    @ShafiqueJamal Check the contents of zeppelin/local-repo, and ideally clean it (unless you have non-public dependencies in there). Then check if all classes that you need are fetched from the central repo (or included in the spark-assembly).Also check the classpath displayed in the SparkUI. – Rick Moritz Aug 25 '17 at 09:31
  • @Rick Mortiz: Thanks for your help. I tried your suggestion (did I do it right? see my update to the question above)? The error persists. Maybe because the older versions of the jackson jars are still present? – Shafique Jamal Aug 25 '17 at 17:55
  • @ShafiqueJamal As you can see, you now have *both* of your direct dependencies pull in **old** versions of jackson-databind. If you can exclude those dependencies, at least Spark should work, BUT you may get errors because these pacakges rely on removed APIs. If don't know the policy for jackson, but would hope that major releases remain backwards compatible. If not, then you may have to build your own jars of those dependencies with shaded jackson dependencies. – Rick Moritz Aug 26 '17 at 09:54