20

If I start up pyspark and then run this command:

import my_script; spark = my_script.Sparker(sc); spark.collapse('./data/')

Everything is A-ok. If, however, I try to do the same thing through the commandline and spark-submit, I get an error:

Command: /usr/local/spark/bin/spark-submit my_script.py collapse ./data/
  File "/usr/local/spark/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1576, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/spark/python/pyspark/shuffle.py", line 245, in mergeValues
    for k, v in iterator:
  File "/.../my_script.py", line 173, in _json_args_to_arr
    js = cls._json(line)
RuntimeError: uninitialized staticmethod object

my_script:

...
if __name__ == "__main__":
    args = sys.argv[1:]
    if args[0] == 'collapse':
        directory = args[1]
        from pyspark import SparkContext
        sc = SparkContext(appName="Collapse")
        spark = Sparker(sc)
        spark.collapse(directory)
        sc.stop()

Why is this happening? What's the difference between running pyspark and running spark-submit that would cause this divergence? And how can I make this work in spark-submit?

EDIT: I tried running this from the bash shell by doing pyspark my_script.py collapse ./data/ and I got the same error. The only time when everything works is when I am in a python shell and import the script.

tamjd1
  • 876
  • 1
  • 10
  • 29
user592419
  • 5,103
  • 9
  • 42
  • 67
  • here you will get better explanation https://stackoverflow.com/questions/33234501/spark-submit-spark-shell-difference-between-yarn-client-and-yarn-cluster-mod][1] – GANESH CHOKHARE Aug 02 '19 at 09:19

3 Answers3

23
  1. If you built a spark application, you need to use spark-submit to run the application

    • The code can be written either in python/scala

    • The mode can be either local/cluster

  2. If you just want to test/run few individual commands, you can use the shell provided by spark

    • pyspark (for spark in python)
    • spark-shell (for spark in scala)
avrsanjay
  • 805
  • 7
  • 12
  • pyspark only support cluster mode with Yarn deployment. Mesos and standalone doesn't support cluster mode. – Ofer Eliassaf Sep 27 '16 at 17:13
  • [pyspark](https://spark.apache.org/docs/0.9.0/python-programming-guide.html) documentation says that it supports standalone too. Please correct if I am missing something here. – avrsanjay Sep 27 '16 at 18:19
  • pyspark supports stand alone in the so called "local mode" which means the driver runs on the machine that submits the job. Only Yarn supports cluster mode for pyspark unfortuantely. – Ofer Eliassaf Sep 28 '16 at 12:49
  • Oh I see, got it. Cheers mate! – avrsanjay Sep 28 '16 at 15:52
  • 1
    Honestly, it's a little dirty on the part of Databricks to decide such non-homogeneous names. They could have named spark-shell-python, spark-shell-scala and spark-shell-r ! We have the same unclear problem with pyspark-shell and sparkr-shell when we want to configure Jupyter kernels. Lot of developpers lost lot of time with these bullshits... – prossblad Nov 06 '19 at 16:33
  • But I guess that it has been decided by advance by Databricks in order to keep competitive with their Cloud solution... In one word their rule seems to be "don't communicate too much and stay fuzzy, in order to keep the power"... But if it continue, developpers will finish by switching to another framework ! – prossblad Nov 06 '19 at 16:33
1

pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test few PySpark commands. This is used during development time. We are talking about Python here.

To run spark application written in Scala or Python on a cluster or locally, you can use spark-submit.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Sharhabeel Hamdan
  • 1,273
  • 13
  • 15
0

spark-submit is a utility to submit your spark program (or job) to Spark clusters. If you open the spark-submit utility, it eventually calls a Scala program.

org.apache.spark.deploy.SparkSubmit 

On the other hand, pyspark or spark-shell is REPL (read–eval–print loop) utility which allows the developer to run/execute their spark code as they write and can evaluate on fly.

Eventually, both of them run a job behind the scene and the majority of the options are the same if you use the following command

spark-submit --help
pyspark --help
spark-shell --help

spark-submit has some additional option to take your spark program (scala or python) as a bundle (jar/zip for python) or individual .py or .class file.

spark-submit --help
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

They both also give a WebUI to track the Spark Job progress and other metrics.

When you kill your spark-shell (pyspark or spark-shell) using Ctrl+c, your spark session is killed and WebUI can not show details anymore.

if you look into spark-shell, it has one additional option to run a scrip line by line using -I

Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation
H Roy
  • 597
  • 5
  • 10