0

I'm on spark-1.4.1. How can I set the system environment variables for Python?

For instance, in R,

Sys.setenv(SPARK_HOME = "C:/Apache/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

What about in Python?

import os
import sys

from pyspark.sql import SQLContext

sc = SparkContext(appName="PythonSQL")
sqlContext = SQLContext(sc)

# Set the system environment variables.
# ref: https://github.com/apache/spark/blob/master/examples/src/main/python/sql.py
if len(sys.argv) < 2:
    path = "file://" + \
        os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")
else:
    path = sys.argv[1]

# Create the DataFrame
df = sqlContext.jsonFile(path)

# Show the content of the DataFrame
df.show()

I get this error,

df is not defined.

enter image description here

Any ideas?

Run
  • 54,938
  • 169
  • 450
  • 748
  • Are you just asking how to set an environment variable in Python code? http://stackoverflow.com/questions/5971312/how-to-set-environment-variables-in-python – mattinbits Aug 04 '15 at 09:51
  • here - `df = sqlContext.jsonFile(path)` – Run Aug 04 '15 at 10:00

1 Answers1

-1

Just try it like this: https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

By providing path = "examples/src/main/resources/people.json" as parameter to df = sqlContext.jsonFile(path)

If you don't provide arguments when you run your python script, then it will go into if len(sys.argv) < 2:, this requires you to have defined SPARK_HOME as a system variable. If not, it won't find your specified .json file. Which seems to be your problem.

Pär Eriksson
  • 367
  • 2
  • 9
  • Have u tried that in your machine (windows)? It does not work on mine. – Run Aug 04 '15 at 13:12
  • 1
    @teelou could you provide how you are running your python script? Are you passing valid arguments since you are using sys.argv? Have you defined location of SPARK_HOME? – Pär Eriksson Aug 04 '15 at 14:04
  • @teelou try running your script from spark root folder without the "if statement" and define it as I defined path above. I'm guessing your path variable gets invalid because you probably have not declared SPARK_HOME as a SYSTEM VARIABLE. – Pär Eriksson Aug 04 '15 at 14:16
  • An other fix would be defining the ABSOLUTE path (if you run shell from other location). This works for me on mac: path="/examples/src/main/resources/people.json" – Pär Eriksson Aug 04 '15 at 14:32
  • `Have you defined location of SPARK_HOME? ` - how do I set SPARK_HOME with Python? I am new to this language. – Run Aug 05 '15 at 01:11
  • 1
    You don't necessary need to do it via python, if you I think you need to run with administrator rights. You can do it manually as describe here: https://www.java.com/en/download/help/path.xml `os.system('your_command')` where the command would be `set SPARK_HOME=C:\spark_root_folder` you can read more here: https://docs.python.org/2/using/windows.html#excursus-setting-environment-variables – Pär Eriksson Aug 05 '15 at 07:58
  • 1
    But if you only want to check that your code works, I would recommend you to try (and without your if statement) run: `bin/pyspark` (if it's called pyspark on windows) from console in your **spark root folder**, then it should find the path I defined in my answer. – Pär Eriksson Aug 05 '15 at 08:05