When querying Cassandra with non-indexed column in the where clause, Spark-Cassandra-Connector's official documentation says,
To filter rows, you can use the filter transformation provided by Spark. However, this approach causes all rows to be fetched from Cassandra and then filtered by Spark.
I am a bit confused about this. If, for example, I have a billion rows of this db structure: ID, City, State, and Country, where only ID is indexed. If I use City = 'Chicago' in where clause, would Spark first download all the billion rows, and then filter out rows where City = 'Chicago'? Or would it read some chunk of data from Cassandra, run the filter, store the rows that match the criteria, then get more chunk of data, get the rows matching the condition, and set them aside again... and continue the process. And if at any point, RAM and or Disk storage is running low, delete/offload/get rid of data that didn't match the criteria, and get the new chunk of data to continue the process?
Also, can someone tell me a general formula to calculate how much disk space would it take to save one bigdecimal column and 3 text columns of billion rows?