I have a Spark program that runs locally on my Windows machine. I use numpy to do some calculations, but I get an exception:
ModuleNotFoundError: No module named 'numpy'
My code:
import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import udf,array
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Playground")\
.config("spark.master", "local")\
.getOrCreate()
@udf("float")
def myfunction(x):
y=np.array([1,3,9])
x=np.array(x)
return cosine(x,y)
df = spark\
.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ])\
.withColumnRenamed("_1","doc")\
.withColumnRenamed("_2","word1")\
.withColumnRenamed("_3","word2")\
.withColumnRenamed("_4","word3")
df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))
df2=df2.withColumn("cosine",myfunction("words"))
df2.show() # The exception is thrown here
However if I run a different file that only includes:
import numpy as np
x = np.array([1, 3, 9])
then it works fine.
Edit:
As pissall suggested in comment, I've installed both numpy and scipy on the venv. Now if I try to run it with spark-submit then it falls on the first line, and if I run it using python.exe then I keep getting the same error message I had before.
I run it like that:
spark-submit --master local spark\ricky2.py --conf spark.pyspark.virtualenv.requirements=requir
ements.txt
requirements.txt:
numpy==1.16.3
pyspark==2.3.4
scipy==1.2.1
But it fails on the first line.
I get the same error for both venv and conda.