0

I want to run graphframes with pyspark.

I found this answer and follow its instruction but it doesn't work.

This is my code hello_spark.py:

import pyspark

conf = pyspark.SparkConf().set("spark.driver.host", "127.0.0.1")
sc = pyspark.SparkContext(master="local", appName="myAppName", conf=conf)
sc.addPyFile("/opt/spark/jars/spark-graphx_2.12-3.0.2.jar")

from graphframes import *

When I run with this command:

spark-submit hello_spark.py 

It returns this error:

from graphframes import *
ModuleNotFoundError: No module named 'graphframes'

This is my .bashrc config:

# For Spark setup
export SPARK_HOME=/opt/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYSPARK_PYTHON=/usr/bin/python3

export SPARK_LOCAL_IP=localhost

export SPARK_OPTS="--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12"

My version of spark: 3.0.2, scala: 2.12.10.

I installed graphframes with this command:

pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12

Does anyone know how to fix this? Thanks.

yangzai
  • 962
  • 5
  • 11
huy
  • 1,648
  • 3
  • 14
  • 40

1 Answers1

0

I found that if I use this command, it will works:

spark-submit hello_spark.py --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 

And you should be noticed that you have to install some dependencies for pyspark like numpy:

File "<frozen zipimport>", line 259, in load_module
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
ModuleNotFoundError: No module named 'numpy'

So I just change the PYSPARK_PYTHON path to my miniconda environment.

export PYSPARK_PYTHON=/home/username/miniconda3/envs/pyenv/bin/python

You can find your environment path by activating it and run which command:

(base) username@user:~$ conda activate pyenv
(pyenv) username@user:~$ which python
/home/username/miniconda3/envs/pyenv/bin/python
huy
  • 1,648
  • 3
  • 14
  • 40