1

I'm trying to load a Spark (2.2.1) package in a Jupyter notebook that can otherwise run Spark fine. Once I add

%env PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'

I get this error upon trying to create a context:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-b25d0ed9494e> in <module>()
----> 1 sc = SparkContext.getOrCreate()
      2 sql_context = SQLContext(sc)

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in getOrCreate(cls, conf)
    332         with SparkContext._lock:
    333             if SparkContext._active_spark_context is None:
--> 334                 SparkContext(conf=conf or SparkConf())
    335             return SparkContext._active_spark_context
    336 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    281         with SparkContext._lock:
    282             if not SparkContext._gateway:
--> 283                 SparkContext._gateway = gateway or launch_gateway(conf)
    284                 SparkContext._jvm = SparkContext._gateway.jvm
    285 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/java_gateway.py in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number

Again, everything works fine as long as PYSPARK_SUBMIT_ARGS is not set (or set to just pyspark-shell). As soon as I add anything else (e.g., if I set it to --master local pyspark-shell) I get this error. Upon googling this, most people suggest simply getting rid of PYSPARK_SUBMIT_ARGS, which I can't for obvious reasons.

I've tried setting my JAVA_HOME as well, although I don't see why that would make a difference seeing as Spark is working without that environment variable. The arguments that I'm passing work outside Jupyter with spark-submit and pyspark.

I guess my first question is, is there any way to get a more detailed error message? Is there a log file somewhere? The current message tells me nothing really.

lfk
  • 2,423
  • 6
  • 29
  • 46
  • Hav you tried running it in console mode i.e. outside of a notebook ? – femibyte Feb 28 '18 at 15:48
  • Yes. Same arguments work with `spark-submit` and `pyspark` (as well as `spark-shell`) – lfk Feb 28 '18 at 22:28
  • Found the problem. Jupyter is including quotation marks in the environment variable. Had to remove those and it works – lfk Feb 28 '18 at 23:08
  • @lfk I used `%env PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell` at the very beginning of my notebook, but still got the same error as I reported [here](https://stackoverflow.com/questions/49861973/): – kww Apr 17 '18 at 15:29

1 Answers1

1

Set PYSPARK_SUBMIT_ARGS as below before initializing SparkContext:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'
Nandeesh
  • 2,683
  • 2
  • 30
  • 42