0

I create a virtual environment and run PySpark script. If I do these steps on MacOS, everything works fine. However, if I run them on Linux (Ubuntu 16), then the incorrect version of Python is picked. Of course, I previously did export PYSPARK_PYTHON=python3 on Linux, but still the same issue. Below I explain all steps:

1. edit profile :vim ~/.profile

2. add the code into the file: export PYSPARK_PYTHON=python3

3. execute command:  source ~/.profile

Then I do:

pip3 install --upgrade pip
pip3 install virtualenv
wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.4.0-bin-hadoop2.7.tgz && rm spark-2.4.0-bin-hadoop2.7.tgz

virtualenv tes-ve 
source test-ve/bin/activate && pip install -r requirements.txt

If I execute python --version inside the visual environment, I see Python 3.5.2.

However when I run Spark code with this command: sudo /usr/local/spark-2.4.0-bin-hadoop2.7/bin/spark-submit mySpark.py, I get Using Python version 2.7... for these lines of code:

print("Using Python version %s (%s, %s)" % (
    platform.python_version(),
    platform.python_build()[0],
    platform.python_build()[1]))
tel
  • 13,005
  • 2
  • 44
  • 62
Mozimaki
  • 187
  • 1
  • 7

1 Answers1

1

PYSPARK_PYTHON sets the call that's used to execute Python on the slave nodes. There's a separate environment variable called PYSPARK_DRIVER_PYTHON that sets the call for the driver node (ie the node on which your script is initially run). So you need to set PYSPARK_DRIVER_PYTHON=python3 too.

Edit

As phd points out, you may be running into trouble with your environment since you're using sudo to call the Pyspark submit. One thing to try would be using sudo -E instead of just sudo. The -E option will preserve your environment (though it isn't perfect).

If that fails, you can try setting the spark.pyspark.driver.python and spark.pyspark.python options directly. For example, you can pass the desired values into your call to spark-submit:

sudo /usr/local/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --conf spark.pyspark.driver.python=python3 --conf spark.pyspark.python=python3 mySpark.py

There's a bunch of different ways to set these options (see this doc for full details). If one doesn't work/is inconvenient for you, try another.

tel
  • 13,005
  • 2
  • 44
  • 62
  • No, I still see `Using Python version 2.7.12 (default, Nov 12 2018 14:36:49)`. Should I run export commands inside the virtual environment, right? – Mozimaki Dec 10 '18 at 13:18
  • I added some alternative fixes to my answer that you can try – tel Dec 10 '18 at 15:32