Fetching millions of records from cassandra using spark in scala performance issue

Question

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.

SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()

score 2 · Answer 1 · answered Aug 01 '19 at 14:55

Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.

If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.

Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.

score 0 · Answer 2 · answered Jul 31 '19 at 11:50

0

How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

answered Jul 31 '19 at 11:50

simpadjo

3,947
1
13
38

I'm not able to COUNT without spark and directly on Cqlsh due to timeout. – zubinbhasingrazitti Jul 31 '19 at 12:13
Then you should fix your cqlsh settings. See https://stackoverflow.com/a/40053579/5385041 . – simpadjo Jul 31 '19 at 12:39
increasing timeout settings for this is a bad idea in cassandra – Chris Lohfink Aug 01 '19 at 14:52

Fetching millions of records from cassandra using spark in scala performance issue

2 Answers2