How do I install numpy and pandas for Python 3.5 in Spark?

Question

I am trying to run a linear regression in Spark using Python 3.5 instead of Python 2.7. So first I exported PYSPARK_PHTHON=python3. I received an error "No module named numpy". I tried to "pip install numpy" but pip doesn't recognize the setting PYSPARK_PYTHON. How to I ask pip to install numpy for 3.5? Thank you ...

$ export PYSPARK_PYTHON=python3

$ spark-submit linreg.py
....
Traceback (most recent call last):
  File "/home/yoda/Code/idenlink-examples/test22-spark-linreg/linreg.py", line 115, in <module>
from pyspark.ml.linalg import Vectors
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 21, in <module>
  File "/home/yoda/install/spark/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
  ImportError: No module named 'numpy'

$ pip install numpy
Requirement already satisfied: numpy in /home/yoda/.local/lib/python2.7/site-packages

$ pyspark
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
17/02/09 20:29:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/09 20:29:20 WARN Utils: Your hostname, yoda-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/09 20:29:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/09 20:29:31 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23)
SparkSession available as 'spark'.
>>> import site; site.getsitepackages()
['/usr/local/lib/python3.5/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.5/dist-packages']
>>>

Hint: Spark (can, and usually does) do its work on a *cluster* of computers. — , Feb 10 '17 at 19:30
You will have to install numpy lib on all computers in cluster used. i.e. if you are only using it on your local machine, then download and add the lib properly. Spark shouldn't care if its numpy or any other lib already linked properly. — Tony Tannous, Feb 10 '17 at 19:30
@JackManey It looks like a local mode. OP just uses wrong pip :) Joshua - using virtualenv, Anaconda or other env management tool is a good idea. — zero323, Feb 10 '17 at 19:38

score 0 · Answer 1 · edited May 23 '17 at 12:24

So I don't actually see this as a spark question at all. It looks to me like you need help with environments. As the commenter mentioned you need to setup a python 3 environment, activate it, and then install numpy. Take a look at this for a little help on working with environments. After setting up a python3 environment you should activate it and then run pip install numpy or conda install numpy and you should be good to go.

score 0 · Answer 2 · answered Dec 21 '18 at 19:00

0

If you are running job local you just need to upgrade pyspark

Homebrew: brew upgrade pyspark this should solve most of the dependencies.

answered Dec 21 '18 at 19:00

oshaiken

2,593
1
15
25

How do I install numpy and pandas for Python 3.5 in Spark?

2 Answers2