We have been using Spark RDD API (Spark 2.0) for working with data modeled in Cassandra.Note that the data is modeled in Cassandra for efficient read and writes.
However now there is also the Spark SQL API's, Spark DataFrame API that is also an alternate data access method -http://spark.apache.org/docs/latest/sql-programming-guide.html
With Spark RDD, we were using CQL using the Datastax Cassandra driver APIs to access the Cassandra DB - http://docs.datastax.com/en/developer/java-driver/2.0/ , something like
val resultSets = new util.ArrayList[Row]()
val resultSet = CassandraConnector(SparkReader.conf).withSessionDo[ResultSet] { session =>
val sel_stmt = QueryBuilder.select("yyy", "zz", "xxxx")
.from("geokpi_keyspace", table_name)
.where(QueryBuilder.eq("bin", bin))
.and(QueryBuilder.eq("year", year))
.and(QueryBuilder.eq("month", month))
.and(QueryBuilder.eq("day", day))
.and(QueryBuilder.eq("cell", cell))
session.execute(sel_stmt)
}
resultSets.addAll(resultSet.all())
})
resultSets.asScala.toList --> RDD[Row]
Since we are using CQL almost directly, it does not allow you to do things that are not supported by Cassandra like JOINS as Cassandra design does not support it. However the alternate way of using Spark SQL or Spark DataFrame API to access the Cassandra DB,gives you an SQL type abstraction.For an underlying Relational DB this would be good.
But using this abstraction,like JOIN to query the data stored in a NoSQL database like Cassandra seems to be a wrong abstraction.Working with this abstraction in Spark , without knowing anything about the data model (partition key, clustering key etc ), which is so important for efficient Read and Write of data, won't it lead to in-efficient generated code and in-efficient/slow data retrieval from underlying Cassandra node ?