0

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.

SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Alex Ott
  • 80,552
  • 8
  • 87
  • 132

2 Answers2

2

Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.

If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.

Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
0

How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

simpadjo
  • 3,947
  • 1
  • 13
  • 38