I create a virtual environment and run PySpark script. If I do these steps on MacOS, everything works fine. However, if I run them on Linux (Ubuntu 16), then the incorrect version of Python is picked. Of course, I previously did export PYSPARK_PYTHON=python3
on Linux, but still the same issue. Below I explain all steps:
1. edit profile :vim ~/.profile
2. add the code into the file: export PYSPARK_PYTHON=python3
3. execute command: source ~/.profile
Then I do:
pip3 install --upgrade pip
pip3 install virtualenv
wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.4.0-bin-hadoop2.7.tgz && rm spark-2.4.0-bin-hadoop2.7.tgz
virtualenv tes-ve
source test-ve/bin/activate && pip install -r requirements.txt
If I execute python --version
inside the visual environment, I see Python 3.5.2
.
However when I run Spark code with this command: sudo /usr/local/spark-2.4.0-bin-hadoop2.7/bin/spark-submit mySpark.py
, I get Using Python version 2.7...
for these lines of code:
print("Using Python version %s (%s, %s)" % (
platform.python_version(),
platform.python_build()[0],
platform.python_build()[1]))