I created a docker image with spark 3.0.0 that is to be used for executing pyspark from a jupyter notebook. The issue I'm having though, when running the docker image locally and testing the following script:
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
print("*** START ***")
sparkConf = SparkConf()
sc = SparkContext(conf=sparkConf)
rdd = sc.parallelize(range(100000000))
print(rdd.sum())
print("*** DONE ***")
I get the following error:
Traceback (most recent call last):
File "test.py", line 9, in <module>
sc = SparkContext(conf=sparkConf)
File "/usr/local/lib/python3.7/dist-packages/pyspark/context.py", line 136, in __init__
conf, jsc, profiler_cls)
File "/usr/local/lib/python3.7/dist-packages/pyspark/context.py", line 213, in _do_init
self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc)
File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1487, in __getattr__
"{0}.{1} does not exist in the JVM".format(self._fqn, name))
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
I've tried using findspark and pip installing py4j fresh on the image, but nothing is working and I can't seem to find any answers other than using findspark. Has anyone else been able to solve this issue using spark 3.0.0?