I have a Jupyter Kernel working with PySpark.
> cat kernel.json
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark"
}
I want to modify this kernel to add a connection to cassandra. In script mode, I type :
pyspark \
--packages anguenot:pyspark-cassandra:0.7.0 \
--conf spark.cassandra.connection.host=12.34.56.78 \
--conf spark.cassandra.auth.username=cassandra \
--conf spark.cassandra.auth.password=cassandra
The script version works perfectly. But I would like to do the same in Jupyter.
Where should I input these informations in my kernel ? I already tried both :
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark with Cassandra",
"spark.jars.packages": "anguenot:pyspark-cassandra:0.7.0",
"spark.cassandra.connection.host": "12.34.56.78",
"spark.cassandra.auth.username": "cassandra",
"spark.cassandra.auth.password": "cassandra"
}
and
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark with Cassandra",
"PYSPARK_SUBMIT_ARGS": "--packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=cassandra"
}
None of them are working. When I execute :
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="my_table", keyspace="my_keyspace")\
.load()
I receive error java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra
.
FYI : I am not creating the Spark session from within the notebook. The sc
object already exists when starting the kernel.