3

I have not faced this problem with any of other software on mysystem. Able to install and run everything in window terminal/command prompt and Git-Bash

Recently, I started learning Spark. Installed Spark setting everything JAVA_HOME, SCALA_HOME, hadoop winutils file. Spark-shell and pyspark-shell both are running perfect in command prompt/window terminal and in Jupyter through pyspark lib.

spark-3.0.1-bin-hadoop2.7
python 3.8.3
Windows 10 
git version 2.29.2.windows.2

But I am not able to figure out it for Git Bash(tried with admin permissions). I am getting this error when I try to run spark-shell or pySpark:

Error: Could not find or load main class org.apache.spark.launcher.Main
/c/Spark/spark-3.0.1-bin-hadoop2.7/bin/spark-class: line 96: CMD: bad array subscript

I searched for solutions and found setting up environment variables in .bashrc or spark-env-sh. Set up following for pySpark shell:

   export JAVA_HOME='/c/Program Files/Java/jdk1.8.0_111'
   export SPARK_HOME='/c/Spark/spark-3.0.1-bin-hadoop2.7'
   export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
   export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
   export PYSPARK_PYTHON='C:/Users/raman/anaconda3/python'
   export PYSPARK_DRIVER_PYTHON='C:/Users/raman/anaconda3/python'

It didn't work out either. If I trace back error in spark-class file. It is as such: In line 96

My question,

  1. what is the reason for this error?How I can resolve it ?
  2. Are there any well-defined steps to set up spark-shell in Git Bash for Windows(not able to find anything solid on net)?

thanks.

BeginnerRP
  • 61
  • 1
  • 5

3 Answers3

4

Try specifically running the spark-shell.cmd from Git Bash, e.g. $SPARK_HOME/bin/spark-shell.cmd. My guess is that when you invoke spark-shell from the windows terminal it automatically launches spark-shell.cmd, and that's why the command works from there.

2

I encountered the same issue. After investigation, the root cause is the classpath passed to java command in git bash is not recognized.

E.g. Below command in git bash won't work as Java command just takes /d/spark/jars/* as a parameter which cannot be found in Windows OS.

java -cp '/d/spark/jars/*' '-Dscala.usejavacp=true' -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name 'Spark shell' spark-shell

Error: Could not find or load main class org.apache.spark.launcher.Main*

After I change to this, it works

java -cp 'D:\spark\jars\*' '-Dscala.usejavacp=true' -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name 'Spark shell' spark-shell
Yuwei Yang
  • 21
  • 2
1

As mentioned here, this depends on the java -cp classpath parameter used by the script when launching Spark.

If said script starts with a #!/bin/sh or #!/bin/bash, add a -x to it (for instance: #!/bin/bash -x)

That will force the script to display every line executed, and you can see more regarding ${#CMD[@]}.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Thanks @VonC. It has shown this error : `Error: Could not find or load main class org.apache.spark.launcher.Main.` I have checked jar file , there are total 246 jar files (many subfiles). I have deleted spark built in package, and installed again. But it is still same. Are there any other ways to debug this. I am thinking of trying to run with latest version of spark 2.0. – BeginnerRP Dec 30 '20 at 02:22
  • 1
    @BeginnerRP The idea is to see what exact classpath is considered by the script when it is executed. – VonC Dec 30 '20 at 02:30