51

I'm trying to run a script in the pyspark environment but so far I haven't been able to.

How can I run a script like python script.py but in pyspark?

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
Daniel Rodríguez
  • 684
  • 3
  • 7
  • 15

6 Answers6

54

You can do: ./bin/spark-submit mypythonfile.py

Running python applications through pyspark is not supported as of Spark 2.0.

Ulas Keles
  • 1,681
  • 16
  • 20
  • 1
    Thanks for the answer, can you tell me how to do it in Windows? – Daniel Rodríguez Oct 13 '16 at 19:17
  • 2
    @DanielRodríguez Should be the same. The spark folder you downloaded should have a `spark-submit` file – OneCricketeer Oct 13 '16 at 19:25
  • 2
    It tells me that 'sc' is not defined, and when I run spark-submit after opening pyspark it throws an invalid syntax error – Daniel Rodríguez Oct 13 '16 at 19:49
  • 2
    It sounds like you are haven't initialized an 'sc' variable with SparkContext(). Take a look at this page, if you haven't already done so https://spark.apache.org/docs/0.9.0/python-programming-guide.html. It's hard to tell what you might be doing wrong without seeing your code. – Ulas Keles Oct 13 '16 at 20:27
32

pyspark 2.0 and later execute script file in environment variable PYTHONSTARTUP, so you can run:

PYTHONSTARTUP=code.py pyspark

Compared to spark-submit answer this is useful for running initialization code before using the interactive pyspark shell.

Jussi Kujala
  • 901
  • 9
  • 7
  • 2
    i dont understand the instructions on what to do here. how do I do your instructions? – Reub Mar 21 '18 at 07:46
  • 1
    Works as a charm. @Dr.DOOM Just type in your shell – dgregory Jun 11 '18 at 07:26
  • This is totally a misleading and wrong answer. Just because "It works", It doesn't mean you should do that. @Ulas Keles answer is the correct one – ciurlaro Mar 25 '21 at 17:01
21

Just spark-submit mypythonfile.py should be enough.

Selva
  • 2,045
  • 1
  • 23
  • 18
14

You can execute "script.py" as follows

pyspark < script.py

or

# if you want to run pyspark in yarn cluster
pyspark --master yarn < script.py
Arun Annamalai
  • 785
  • 1
  • 7
  • 20
2

Existing answers are right (that is use spark-submit), but some of us might want to just get started with a sparkSession object like in pyspark.

So in the pySpark script to be run first add:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master('yarn') \
    .appName('pythonSpark') \
    .enableHiveSupport()
    .getOrCreate()

Then use spark.conf.set('conf_name', 'conf_value') to set any configuration like executor cores, memory, etc.

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
1

Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is,

$ spark-submit --master <url> <SCRIPTNAME>.py.

I'm running spark in windows 64bit architecture system with JDK 1.8 version.

P.S find a screenshot of my terminal window. Code snippet

Community
  • 1
  • 1