Set up jupyter on EMR to read from cassandra using cql?

Question

When I try to set the spark context in jupyter with

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'

or

spark = SparkSession.builder \
  .appName('SparkCassandraApp') \
  .config('spark.cassandra.connection.host', 'x.x.x.x') \
  .config('spark.cassandra.connection.port', 'xxxx') \
  .config('spark.cassandra.output.consistency.level','ONE') \
  .master('local[2]') \
  .getOrCreate()

I still cannot make a connection to the cassandra cluster with the code

dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()

Comes up with error:

An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. 
Please find packages at http://spark.apache.org/third-party-projects.html

A similar question was asked here modify jupyter kernel to add cassandra connection in spark

but i do not see a valid answer.

what is the Spark version in EMR? – Alex Ott Apr 10 '21 at 12:30 — Alex Ott, Apr 10 '21 at 12:30
2.4.x! I am using spark emr 5.32 @AlexOtt – Ben Reber Apr 12 '21 at 14:18 — Ben Reber, Apr 12 '21 at 14:18

Set up jupyter on EMR to read from cassandra using cql?

0 Answers0