2

Note: This is NOT a duplicate of Getting NullPointerException when running Spark Code in Zeppelin 0.7.1


I've run into this roadblock in Apache Zeppelin on Amazon EMR. I'm trying to load a fat-jar (located on Amazon S3) into Spark interpreter. Once the fat-jar is loaded, Zeppelin's Spark interpreter refuses to work with following stack-trace

java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38) at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387) at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Even a simple Scala statement like val str: String = "sample string" that doesn't access anything from the jar produces the above error-log. Removing the jar from interpreter's dependencies fixes the issue; so clearly, it has something to do with the jar only.

The fat-jar in question has been generated by Jenkins using sbt assembly. The project (who's fat-jar I'm loading) contains two submodules inside a parent module.


While sharing the complete build.sbt files and dependency files of all 3 submodules would be impractical, I'm enclosing an exhaustive list of all dependencies and configurations used in the submodules.

AWS dependencies

  • "com.amazonaws" % "aws-java-sdk-s3" % "1.11.218"
  • "com.amazonaws" % "aws-java-sdk-emr" % "1.11.218"
  • "com.amazonaws" % "aws-java-sdk-ec2" % "1.11.218"

Spark dependencies (given as provided allSparkdependencies.map(_ % "provided"))

  • "org.apache.spark" %% "spark-core" % "2.2.0"
  • "org.apache.spark" %% "spark-sql" % "2.2.0"
  • "org.apache.spark" %% "spark-hive" % "2.2.0"
  • "org.apache.spark" %% "spark-streaming" % "2.2.0"

Testing dependencies

  • "org.scalatest" %% "scalatest" % "3.0.3" % Test
  • "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"

Other dependencies

  • "com.github.scopt" %% "scopt" % "3.7.0"
  • "com.typesafe" % "config" % "1.3.1"
  • "com.typesafe.play" %% "play-json" % "2.6.6"
  • "joda-time" % "joda-time" % "2.9.9"
  • "mysql" % "mysql-connector-java" % "5.1.41"
  • "com.github.gilbertw1" %% "slack-scala-client" % "0.2.2"
  • "org.scalaj" %% "scalaj-http" % "2.3.0"

Framework versions

  • Scala v2.11.11
  • SBT v1.0.3
  • Spark v2.2.0
  • Zeppelin v0.7.3

SBT Configurations

// cache options
offline := false
updateOptions := updateOptions.value.withCachedResolution(true)

// aggregate options
aggregate in assembly := false
aggregate in update := false

// fork options
fork in Test := true

// merge strategy
assemblyMergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
  case m if m.startsWith("META-INF") => MergeStrategy.discard
  case PathList("javax", "servlet", _@_*) => MergeStrategy.first
  case PathList("org", "apache", _@_*) => MergeStrategy.first
  case PathList("org", "jboss", _@_*) => MergeStrategy.first
  case "about.html" => MergeStrategy.rename
  case "reference.conf" => MergeStrategy.concat
  case "application.conf" => MergeStrategy.concat
  case _ => MergeStrategy.first
}
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • There is something in the aws-*.jar that is breaking Zeppelin and I also can't figure out what... – eliasah Jan 31 '18 at 07:49
  • Solutions mentioned [here](https://stackoverflow.com/questions/43289067/getting-nullpointerexception-when-running-spark-code-in-zeppelin-0-7-1) are either not working or are *(most-probably)* inapplicable in my scenario. – y2k-shubham Jan 31 '18 at 07:51
  • 1
    I suggest that you add this to your question so it doesn't get flagged as a dupe – eliasah Jan 31 '18 at 07:52
  • 1
    **@eliasah** I'd like to tell you that another *jar* that also contains the 2 **submodules** (that I have in this problematic jar) works just fine. And one of those submodules contains the `aws-*.jar` dependencies. So at least in my case, it's unlikely that `aws-*.jar` are the culprit. – y2k-shubham Jan 31 '18 at 07:56
  • @y2k-shubham Did you ever find out the solution? I'm facing the exact same problem. – xan Mar 23 '18 at 22:59
  • **@ss85** please do convey if the *proposed solution* worked for you and / or whatever *additional changes* you made to fix the *glitch* – y2k-shubham Mar 24 '18 at 05:34

1 Answers1

0

While the problem got fixed, honestly speaking I was unable to drill down to the root cause of it (and hence a real solution for it). After rigorously going through forums in vain, I ended up manually comparing (and re-aligning) my code (git diff) with the last known working build. (!)

It's been a while since then and now when I check my git history, I find it (the commit that fixed this problem) contains either refactoring or build-related stuff. Therefore my best guess is that it was a build-related issue. I'm putting down all changes that I made to build.sbt.

I re-iterate that I cannot establish if it was for these particular modifications that the issue got fixed, so keep looking. I'll keep this question open until a conclusive cause (and solution) is found.


Mark the following dependencies as provided as told here:

"org.apache.spark" %% "spark-core" % sparkVersion
"org.apache.spark" %% "spark-sql" % sparkVersion
"org.apache.spark" %% "spark-hive" % sparkVersion
"org.apache.spark" %% "spark-streaming" % sparkVersion
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2" % "test"

Override the following fasterxml.jackson dependencies:

dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" %
"2.6.5"

I'd like to point out one particular thing: the following LogBack dependency that I initially held culprit actually had nothing to do with this (we had faced issues with LogBack in the past, so it suited us to blame it). While we removed it at the time of resolution, we've added it back since then.

"ch.qos.logback" % "logback-classic" % "1.2.3"
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • [Zeppelin-2475](https://issues.apache.org/jira/browse/ZEPPELIN-2475) seems related to this – y2k-shubham May 09 '18 at 03:48
  • Looks like it was `jackson` which was annoying `Zeppelin` all this while; i just came across [this comment](https://issues.apache.org/jira/browse/ZEPPELIN-2475?focusedCommentId=16330414&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16330414) on **[Zeppelin-2475]** – y2k-shubham Aug 29 '18 at 05:06