2

I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:

ModuleNotFoundError: No module named 'numpy'

My code:

import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession

spark = SparkSession\
      .builder\
      .appName("Playground")\
      .config("spark.master", "local")\
      .getOrCreate()

@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y)


df = spark\
    .createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
    .withColumnRenamed("_1","doc")\
    .withColumnRenamed("_2","word1")\
    .withColumnRenamed("_3","word2")\
    .withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))

df2.show() # The exception is thrown here

However if I run a different file that only includes:

import numpy as np
x = np.array([1, 3, 9])

then it works fine.

Edit:

As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.

I run it like that:

spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt

requirements.txt:

numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1

But it fails on the first line.

I get the same error for both venv and conda.

Alon
  • 10,381
  • 23
  • 88
  • 152
  • try to cast the return type of your function: `return float(cosine(x,y))` – jxc Oct 03 '19 at 23:28
  • 1
    @jxc his code will fail at the first line itself, don't you think? – pissall Oct 04 '19 at 04:51
  • What is the directory of the `python` executable that `pyspark` uses? Have you installed `numpy` in it's `site-packages`? – pissall Oct 04 '19 at 04:53
  • @jxc it didn't help. – Alon Oct 04 '19 at 06:14
  • @pissall yes I can see numpy under external libraries/python 3.7/site-packages. However numpy doesn't exist in the site-packages under project/venv/lib/site-packages. Maybe it has something to do with this? – Alon Oct 04 '19 at 10:42
  • @pissall, i see. I copied OP's code directly into my notebook w/o checking the detailed description, that might be another error anyway in that code. – jxc Oct 04 '19 at 13:14
  • @Alon In Unix systems, you can say `which pyspark` to know the directory. If it's in your `venv`, then do a pip install after activating the `venv` – pissall Oct 04 '19 at 13:29
  • @pissall I'm not familiar with venv. For me it's just a directory that PyCharm has created. – Alon Oct 04 '19 at 16:20
  • ‘source project/venv/bin/activate’ will activate the virtual environment. Then just say ‘pip install jumpy’ – pissall Oct 04 '19 at 16:24
  • @pissall I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before. – Alon Oct 04 '19 at 17:25
  • Maybe this could help https://stackoverflow.com/questions/29449271/no-module-named-numpy-when-spark-submitting#30333076 – Grzegorz Skibinski Oct 05 '19 at 23:03
  • @Alon, you talked about PyCharm. You can look at the installed libraries in 'File > Project Structure > SDKs > Packages'. Make sure that numpy is listed there. When you run your code (with spark-submit or python.exe), do you run it from PyCharm or manually from the console. If it is from the console, is you venv activated? If from PyCharm, what is the Run Configuration (in particular the Python interpreter)? – AlexisBRENON Oct 09 '19 at 13:23
  • @AlexisBRENON I don't have a Project Structure option under File. I can see Numpy Under File->Settings->Project Interpreter. Anyway I know for sure that Numpy works. It just doesn't work together with Spark. – Alon Oct 09 '19 at 16:55
  • @AlexisBRENON anyway I run it now using the venv. Yes I have activated it. – Alon Oct 09 '19 at 16:56
  • Possible duplicate of [spark-submit with specific python librairies](https://stackoverflow.com/questions/48644166/spark-submit-with-specific-python-librairies) – jslipknot Oct 14 '19 at 07:06

1 Answers1

3

It looks like numpy is installed on a different runtime than the one used by Spark. You can tell what runtime to use to spark by setting environment variable PYSPARK_PYTHON.

In the spark configuration file, (conf/spark-env.sh in spark's installation dir. Not sure about windows, but spark distribution contains spark-env.sh.template - spark-env.cmd.template on Windows I think-. It must be renamed to spark-env.sh (spark-env.cmd) first.)

PYSPARK_PYTHON=<path to your python runtime/executable>

You can read more about environment variables in the docs.

0xc0de
  • 8,028
  • 5
  • 49
  • 75
  • Regarding PYSPARK_PYTHON, I've tried it, it didn't work, but it uses the correct SPARK_HOME location anyway. Maby I should install numpy inside SPARK_HOME somehow? – Alon Oct 10 '19 at 11:18
  • Regarding spark-env.cmd.template, I didn't understand what you wanted me to do with this. The whole file is in comment. – Alon Oct 10 '19 at 11:19