0

I have a Jupyter Kernel working with PySpark.

> cat kernel.json
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark"
}

I want to modify this kernel to add a connection to cassandra. In script mode, I type :

pyspark \
    --packages anguenot:pyspark-cassandra:0.7.0 \
    --conf spark.cassandra.connection.host=12.34.56.78 \
    --conf spark.cassandra.auth.username=cassandra \
    --conf spark.cassandra.auth.password=cassandra

The script version works perfectly. But I would like to do the same in Jupyter.

Where should I input these informations in my kernel ? I already tried both :

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "spark.jars.packages": "anguenot:pyspark-cassandra:0.7.0",
 "spark.cassandra.connection.host": "12.34.56.78",
 "spark.cassandra.auth.username": "cassandra",
 "spark.cassandra.auth.password": "cassandra"
}

and

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "PYSPARK_SUBMIT_ARGS": "--packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=cassandra"
}

None of them are working. When I execute :

sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()

I receive error java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.


FYI : I am not creating the Spark session from within the notebook. The sc object already exists when starting the kernel.

Steven
  • 14,048
  • 6
  • 38
  • 73
  • @user8371915 the "possible dupplicate" is not answering my problem as I am not `creating the Spark session from within the notebook` as mention is the answer ... – Steven Jun 01 '18 at 09:32
  • Packages have to be included __before the session is initialized__. Where you initialize it doesn't matter and the same methods apply. Configuration option which are set after session objects has been created, won't have effect at all. – Alper t. Turker Jun 01 '18 at 09:35
  • @user8371915 OK, so I tried : ` spark = SparkSession.builder.appName('my_awesome')\ .config("spark.jars.packages", "anguenot:pyspark-cassandra:0.7.0")\ .config("spark.cassandra.connection.host", "12.34.56.78")\ .config("spark.cassandra.auth.username", "cassandra")\ .config("spark.cassandra.auth.password", "cassandra")\ .getOrCreate() spark.read\ .format("org.apache.spark.sql.cassandra")\ .options(table="my_table", keyspace="my_keyspace")\ .load() java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. ` – Steven Jun 01 '18 at 09:41
  • `getOrCreate` will use existing `SparkContext`. You have to set the configuration, before any Spark objects are used (to be precise before JVM has been started). Personally I would just use Spark configuration and move on. Hardcoding this anywhere in the kernel seem like a bad idea. – Alper t. Turker Jun 01 '18 at 09:47
  • @user8371915 That is the reason why I need to input the parameters in jupyter kernel ... because jupyter automatically initialize the SparkContext. I cannot input anything prior to that. Therefore, the answer you gave me is not what I need. – Steven Jun 01 '18 at 09:48
  • If you don't want to modify kernel and / or Spark configuration then it is just not possible. Once JVM is up, no related configuration will have an effect. – Alper t. Turker Jun 01 '18 at 09:50
  • @user8371915 I never said I dont want. Im just saying I dont know how. I tried and failled, and your duplicate answer did not help. – Steven Jun 01 '18 at 09:51

1 Answers1

0

spark.jars.* options has to be configured before SparkContext has been initialized. After this happened, configuration will have no effect. This means you have to do one of the following:

  • Modify SPARK_HOME/conf/spark-defaults.conf or SPARK_CONF_DIR/spark-defaults.conf and make sure that SPARK_HOME or SPARK_CONF_DIR are in the scope when kernel is started.
  • Modify kernel initializing code (where the SparkContext is initialized) using the same methods as described in Add Jar to standalone pyspark

I would also strongly recommend Configuring Spark to work with Jupyter Notebook and Anaconda

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115