I'm trying to load a Spark (2.2.1) package in a Jupyter notebook that can otherwise run Spark fine. Once I add
%env PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'
I get this error upon trying to create a context:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-5-b25d0ed9494e> in <module>()
----> 1 sc = SparkContext.getOrCreate()
2 sql_context = SQLContext(sc)
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in getOrCreate(cls, conf)
332 with SparkContext._lock:
333 if SparkContext._active_spark_context is None:
--> 334 SparkContext(conf=conf or SparkConf())
335 return SparkContext._active_spark_context
336
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
113 """
114 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
281 with SparkContext._lock:
282 if not SparkContext._gateway:
--> 283 SparkContext._gateway = gateway or launch_gateway(conf)
284 SparkContext._jvm = SparkContext._gateway.jvm
285
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/java_gateway.py in launch_gateway(conf)
93 callback_socket.close()
94 if gateway_port is None:
---> 95 raise Exception("Java gateway process exited before sending the driver its port number")
96
97 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
Again, everything works fine as long as PYSPARK_SUBMIT_ARGS
is not set (or set to just pyspark-shell
). As soon as I add anything else (e.g., if I set it to --master local pyspark-shell
) I get this error. Upon googling this, most people suggest simply getting rid of PYSPARK_SUBMIT_ARGS
, which I can't for obvious reasons.
I've tried setting my JAVA_HOME
as well, although I don't see why that would make a difference seeing as Spark is working without that environment variable. The arguments that I'm passing work outside Jupyter with spark-submit
and pyspark
.
I guess my first question is, is there any way to get a more detailed error message? Is there a log file somewhere? The current message tells me nothing really.