Note: This is NOT a duplicate of Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
I've run into this roadblock in Apache Zeppelin
on Amazon EMR
. I'm trying to load a fat-jar (located on Amazon S3
) into Spark interpreter
. Once the fat-jar is loaded, Zeppelin
's Spark interpreter
refuses to work with following stack-trace
java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38) at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387) at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Even a simple Scala
statement like val str: String = "sample string"
that doesn't access anything from the jar produces the above error-log. Removing the jar from interpreter's dependencies fixes the issue; so clearly, it has something to do with the jar only.
The fat-jar
in question has been generated by Jenkins
using sbt assembly
. The project (who's fat-jar I'm loading) contains two submodules
inside a parent module.
While sharing the complete build.sbt
files and dependency files of all 3 submodules would be impractical, I'm enclosing an exhaustive list of all dependencies and configurations used in the submodules.
AWS dependencies
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-emr" % "1.11.218"
"com.amazonaws" % "aws-java-sdk-ec2" % "1.11.218"
Spark dependencies (given as provided allSparkdependencies.map(_ % "provided")
)
"org.apache.spark" %% "spark-core" % "2.2.0"
"org.apache.spark" %% "spark-sql" % "2.2.0"
"org.apache.spark" %% "spark-hive" % "2.2.0"
"org.apache.spark" %% "spark-streaming" % "2.2.0"
Testing dependencies
"org.scalatest" %% "scalatest" % "3.0.3" % Test
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"
Other dependencies
"com.github.scopt" %% "scopt" % "3.7.0"
"com.typesafe" % "config" % "1.3.1"
"com.typesafe.play" %% "play-json" % "2.6.6"
"joda-time" % "joda-time" % "2.9.9"
"mysql" % "mysql-connector-java" % "5.1.41"
"com.github.gilbertw1" %% "slack-scala-client" % "0.2.2"
"org.scalaj" %% "scalaj-http" % "2.3.0"
Framework versions
Scala v2.11.11
SBT v1.0.3
Spark v2.2.0
Zeppelin v0.7.3
SBT Configurations
// cache options
offline := false
updateOptions := updateOptions.value.withCachedResolution(true)
// aggregate options
aggregate in assembly := false
aggregate in update := false
// fork options
fork in Test := true
// merge strategy
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", _@_*) => MergeStrategy.first
case PathList("org", "apache", _@_*) => MergeStrategy.first
case PathList("org", "jboss", _@_*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}