Cannot use numpy with Spark

Question

I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:

ModuleNotFoundError: No module named 'numpy'

My code:

import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession

spark = SparkSession\
      .builder\
      .appName("Playground")\
      .config("spark.master", "local")\
      .getOrCreate()

@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y)


df = spark\
    .createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
    .withColumnRenamed("_1","doc")\
    .withColumnRenamed("_2","word1")\
    .withColumnRenamed("_3","word2")\
    .withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))

df2.show() # The exception is thrown here

However if I run a different file that only includes:

import numpy as np
x = np.array([1, 3, 9])

then it works fine.

Edit:

As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.

I run it like that:

spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt

requirements.txt:

numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1

But it fails on the first line.

I get the same error for both venv and conda.

try to cast the return type of your function: `return float(cosine(x,y))` — jxc, Oct 03 '19 at 23:28
@jxc his code will fail at the first line itself, don't you think? — pissall, Oct 04 '19 at 04:51
What is the directory of the `python` executable that `pyspark` uses? Have you installed `numpy` in it's `site-packages`? — pissall, Oct 04 '19 at 04:53
@pissall yes I can see numpy under external libraries/python 3.7/site-packages. However numpy doesn't exist in the site-packages under project/venv/lib/site-packages. Maybe it has something to do with this? — Alon, Oct 04 '19 at 10:42
@pissall, i see. I copied OP's code directly into my notebook w/o checking the detailed description, that might be another error anyway in that code. — jxc, Oct 04 '19 at 13:14
@Alon In Unix systems, you can say `which pyspark` to know the directory. If it's in your `venv`, then do a pip install after activating the `venv` — pissall, Oct 04 '19 at 13:29
@pissall I'm not familiar with venv. For me it's just a directory that PyCharm has created. — Alon, Oct 04 '19 at 16:20
‘source project/venv/bin/activate’ will activate the virtual environment. Then just say ‘pip install jumpy’ — pissall, Oct 04 '19 at 16:24
@pissall I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before. — Alon, Oct 04 '19 at 17:25
Maybe this could help https://stackoverflow.com/questions/29449271/no-module-named-numpy-when-spark-submitting#30333076 — Grzegorz Skibinski, Oct 05 '19 at 23:03
@Alon, you talked about PyCharm. You can look at the installed libraries in 'File > Project Structure > SDKs > Packages'. Make sure that numpy is listed there. When you run your code (with spark-submit or python.exe), do you run it from PyCharm or manually from the console. If it is from the console, is you venv activated? If from PyCharm, what is the Run Configuration (in particular the Python interpreter)? — AlexisBRENON, Oct 09 '19 at 13:23
@AlexisBRENON I don't have a Project Structure option under File. I can see Numpy Under File->Settings->Project Interpreter. Anyway I know for sure that Numpy works. It just doesn't work together with Spark. — Alon, Oct 09 '19 at 16:55
@AlexisBRENON anyway I run it now using the venv. Yes I have activated it. — Alon, Oct 09 '19 at 16:56
Possible duplicate of [spark-submit with specific python librairies](https://stackoverflow.com/questions/48644166/spark-submit-with-specific-python-librairies) — jslipknot, Oct 14 '19 at 07:06

0xc0de · Answer 1 · 2019-10-10T08:02:50.410

3

It looks like numpy is installed on a different runtime than the one used by Spark. You can tell what runtime to use to spark by setting environment variable PYSPARK_PYTHON.

In the spark configuration file, (conf/spark-env.sh in spark's installation dir. Not sure about windows, but spark distribution contains spark-env.sh.template - spark-env.cmd.template on Windows I think-. It must be renamed to spark-env.sh (spark-env.cmd) first.)

PYSPARK_PYTHON=<path to your python runtime/executable>

You can read more about environment variables in the docs.

edited Oct 10 '19 at 08:02

answered Oct 10 '19 at 07:37

0xc0de

8,028
5
49
75

Regarding PYSPARK_PYTHON, I've tried it, it didn't work, but it uses the correct SPARK_HOME location anyway. Maby I should install numpy inside SPARK_HOME somehow? – Alon Oct 10 '19 at 11:18
Regarding spark-env.cmd.template, I didn't understand what you wanted me to do with this. The whole file is in comment. – Alon Oct 10 '19 at 11:19

Cannot use numpy with Spark

1 Answers1