Zeppelin over CDH 5.7.1 with Spark 1.6.0 NullPointerException when using DataFrame

Question

[Note: Although this question has no answer, don't just pass by. Sebastian Piu's comments are helpful.]

I've installed Zeppelin-0.6.2-bin-all on cloudera CDH 5.7.1 with Spark 1.6.0

I've set these environment variables both in conf/zeppelin-env.sh and ~/.bashrc

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera/
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/
export ZEPPELIN_HOME=/var/lib/zeppelin-0.6.2/

On a new created notebook, my first paragraph with the following command is OK:

sc.parallelize(Seq(1,2,3,4,5))

the result is:

res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:30

But the second paragraph, which just converts that RDD into DataFrame, fails with NullPointerException:

sc.parallelize(Seq(1,2,3,4,5)).toDF("number")

the result is dumped as:

java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:554)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:553)
at org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:540)
at org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:539)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:539)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:252)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext$$anon$2.<init>(HiveContext.scala:459)
at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:459)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:458)
at org.apache.spark.sql.hive.HiveContext$$anon$3.<init>(HiveContext.scala:475)
at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:475)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:474)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.internalCreateDataFrame(SQLContext.scala:532)
at org.apache.spark.sql.SQLImplicits.intRddToDataFrameHolder(SQLImplicits.scala:185)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
at $iwC$$iwC$$iwC.<init>(<console>:43)
at $iwC$$iwC.<init>(<console>:45)
at $iwC.<init>(<console>:47)
at <init>(<console>:49)
at .<init>(<console>:53)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:953)
at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1168)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1111)
at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1104)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I've tested that that command is OK in spark-shell:

scala> sc.parallelize(Seq(1,2,3,4,5)).toDF("number")
res0: org.apache.spark.sql.DataFrame = [number: int]

I've read the document and thought that "installing spark interpreter built with scala 2.10" might be helpful.

So I've entered these commands, which also failed:

# bin/zeppelin-daemon.sh stop
# mv interpreter/spark interpreter/spark.bak
# bin/install-interpreter.sh --name spark --artifact org.apache.zeppelin:zeppelin-spark_2.10:0.6.2

the response is:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/var/lib/zeppelin-0.6.2/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/var/lib/zeppelin-0.6.2/lib/zeppelin-interpreter-0.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Install spark(org.apache.zeppelin:zeppelin-spark_2.10:0.6.2) to /var/lib/zeppelin-0.6.2/interpreter/spark ...
Exception in thread "main" java.lang.NullPointerException
    at org.sonatype.aether.impl.internal.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:352)
    at org.apache.zeppelin.dep.DependencyResolver.getArtifactsWithDep(DependencyResolver.java:176)
    at org.apache.zeppelin.dep.DependencyResolver.loadFromMvn(DependencyResolver.java:129)
    at org.apache.zeppelin.dep.DependencyResolver.load(DependencyResolver.java:77)
    at org.apache.zeppelin.dep.DependencyResolver.load(DependencyResolver.java:94)
    at org.apache.zeppelin.dep.DependencyResolver.load(DependencyResolver.java:86)
    at org.apache.zeppelin.interpreter.install.InstallInterpreter.install(InstallInterpreter.java:170)
    at org.apache.zeppelin.interpreter.install.InstallInterpreter.install(InstallInterpreter.java:150)
    at org.apache.zeppelin.interpreter.install.InstallInterpreter.main(InstallInterpreter.java:275)

(Note: the system has maven-3.3.9 installed, and command "mvn" is OK)

I suspect it is caused by a wrong java version. So I've tried

export JAVA_HOME=/usr/java/jdk1.8.0_05

and

export JAVA_HOME=/usr/java/jdk1.6.0_45

but both got the same error results.

I've also tried downgrading to Zeppelin-0.6.1-bin-all but got the same results.

I think the above results are reproducible because I've retried them in another "CDH 5.7.1 with Spark 1.6.0" cloud, and also got the same results.

How can I make it work?

We are experiencing a similar issue, I haven't tried building from source due to lack of time but one thing I found out is that if the first invocation is on the HiveContext then it seems to work just fine. You can also try creating a new hive context and then doing a query on it instead, and it will work. — Sebastian Piu, Jan 21 '17 at 15:27
Thanks to your comment, the following code can workaround: import org.apache.spark.sql.hive._ val dummy = new HiveContext(sc) sc.parallelize(Seq(1,2,3,4,5)).toDF("number") Here dummy HiveContext is just created and not used at all. I saw [this answer](http://stackoverflow.com/questions/33666545/what-is-the-difference-between-apache-spark-sqlcontext-vs-hivecontext) mentioned "HiveContext is required to start Thrift server." — John Lin, Jan 23 '17 at 06:27
I am trying to build from source. However, I am behind a firewall and have URL restrictions, so it would take more time. Besides, we are in a Chinese New Year 11-day vacation. Will update if I succeed. — John Lin, Jan 26 '17 at 09:00
Same here, and needs to be built using Linux so I need to find the time to do it — Sebastian Piu, Jan 26 '17 at 09:05
I have successfully built from source. The problem remains the same. — John Lin, Feb 07 '17 at 08:20
I have upgraded to zeppelin-0.7.0-bin-all. The problem remains the same. I have tried replacing zeppelin-spark_2.11-0.7.0.jar zeppelin-spark-dependencies_2.11-0.7.0.jar scala-compiler-2.11.7.jar scala-library-2.11.7.jar scala-reflect-2.11.7.jar with respective 2.10.5 jars from maven repository. But still can't solve it. — John Lin, Feb 13 '17 at 07:27
Looks like there is a related a bug in Spark 1.6.0, https://github.com/apache/spark/pull/14816 ? — Sebastian Piu, Feb 13 '17 at 07:56
I am new in StackOverflow. In this case that this question won't have an answer, shall I leave it open? Or shall I do something to close it? — John Lin, Feb 13 '17 at 08:21
As long as it's on topic leave it open, if it gets too chatty we can use a chat room though — Sebastian Piu, Feb 13 '17 at 08:24

Zeppelin over CDH 5.7.1 with Spark 1.6.0 NullPointerException when using DataFrame

0 Answers0