I went through a long painful road to find a solution that works here.
I am working with the native jupyter server within VS code. In there, I created a .env
file:
SPARK_HOME=/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
PYSPARK_SUBMIT_ARGS="--driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 pyspark-shell"
Then in my python notebook I have something that looks like the following:
from pyspark.sql.types import *
from graphframes import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('GraphFrames').getOrCreate()
You should see the code to print out and fetch the dependencies accordingly. Something like this:
:: loading settings :: url = jar:file:/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/adam/.ivy2/cache
The jars for the packages stored in: /home/adam/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-96a3a1f1-4ea4-4433-856b-042d0269ec1a;1.0
confs: [default]
found graphframes#graphframes;0.8.2-spark3.2-s_2.12 in spark-packages
found org.slf4j#slf4j-api;1.7.16 in central
:: resolution report :: resolve 174ms :: artifacts dl 8ms
:: modules in use:
graphframes#graphframes;0.8.2-spark3.2-s_2.12 from spark-packages in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
after that I was able to create some code with the relationships:
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
It should work fine. Just remember to align all your pyspark versions. I had to install the proper versions of graphframes
from a forked repo. The PiPy install is behind on versions so I had to use the PHPirates
repo to do the proper install. Here the graphframes has been compiled for version 3.2.0
of pyspark
.
pip install "git+https://github.com/PHPirates/graphframes.git@add-setup.py#egg=graphframes&subdirectory=python"
pip install pyspark==3.2.0