Apache Spark: Importing jars

Question

I am using Apache Spark on my windows machine. I am relatively new to this, and I am working locally before uploading my code to the cluster.

I've written a very simple scala program and everything works fine:

println("creating Dataframe from json")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rawData = sqlContext.read.json("test_data.txt")
println("this is the test data table")
rawData.show()
println("finished running")

The program executes correctly. I now want to add some processing which calls some simple Java functions that I've pre-packaged in a JAR file. I'm running the scala shell. As it says on the getting started page, I startup the shell with:

c:\Users\eshalev\Desktop\spark-1.4.1-bin-hadoop2.6\bin\spark-shell --master local[4] --jars myjar-1.0-SNAPSHOT.jar

Important fact: I don't have hadoop installed on my local machine. But as I'm only parsing a text file this shouldn't matter, and didn't matter until I used --jars.

I now proceed to run the same scala program. There are no references to the jar file yet... This time I get:

...some SPARK debug code here and then...
    15/09/08 14:27:37 INFO Executor: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar with timestamp 144
    1715239626
    15/09/08 14:27:37 INFO Utils: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar-1.0 to C:\Users\eshalev\A
    ppData\Local\Temp\spark-dd9eb37f-4033-4c37-bdbf-5df309b5eace\userFiles-ebe63c02-8161-4162-9dc0-74e3df6f7356\fetchFileTem
    p2982091960655942774.tmp
    15/09/08 14:27:37 INFO Executor: Fetching http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar with timestamp 144
    1715239626
    15/09/08 14:27:37 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
    java.lang.NullPointerException
            at java.lang.ProcessBuilder.start(Unknown Source)
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
            at org.apache.hadoop.util.Shell.run(Shell.java:455)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
            at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
            at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
            at org.apache.spark.util.Utils$.fetchFile(Utils.scala:465)
... aplenty more spark debug messages here, and then ...
this is the test data table
<console>:20: error: not found: value rawData
              rawData.show()
              ^
finished running

I double checked http://10.61.97.179:62752/jars/myjar-1.0-SNAPSHOT.jar-1.0-SNAPSHOT.jar, and I can download it just fine. And then again, nothing in the code references the jar yet. If start the shell without --jar everything works fine.

Is the code above in a scala object? If so you would want to use spark-submit instead of spark-shell — Erik Schmiegelow, Sep 08 '15 at 14:33
No, as can be seen in the snippet, it is not an object . And as mentioned spark successfully runs the scala without errors, as long as I don't use the --jar flag when starting spark. (I am not trying to call the java code *yet*) — eshalev, Sep 09 '15 at 06:53
I was wondering, because you wrote "I've written a very simple scala program and everything works fine:" What is in that jar? Can you provide more specific information? — Erik Schmiegelow, Sep 09 '15 at 08:51
The jar is built from a maven project. It contains a some static mathematical pure java methods, which I want to call to transform "rawData" from the scala code above. The jar does NOT include the scala program. The jar has no dependencies from SPARK. The jar code is not yet invoked. At the moment, I just launch scala-shell with "--jars" and load the simple code listed above. That *exact* scala code which works when I don't use "--jars", stops working when "--jars" is specified. Once, when I misspelled the jar file path, it worked by accident because the "--jars wrong_jar.jar" was ignored. — eshalev, Sep 09 '15 at 10:04

score 0 · Answer 1 · answered Sep 15 '15 at 08:33

I tried this on another cluster which is spark 1.3.1 and has hadoop installed. It worked flawlessly.

The number of times hadoop was mentioned in the stack-trace on my single node setup leads me to believe that an actual hadoop installation is required to use the --jars flag.

The other option being a problem with my spark 1.4 setup, which has worked flawlessly until then.

Apache Spark: Importing jars

1 Answers1

Linked