I would like to stream data from a cassandra table which is updated in real time. Yes, it is a database but is there a way to do that? If so, keeping an offset or what CQL queries can I use ?
-
Does anyone have some idea on how spark-cassandra-connector can be used in this ? does it take of care of offset on its own and does it stream data in near real time ? – krish at Mar 01 '16 at 21:09
3 Answers
Short answer is no.
Long answer is with a lot of difficulty and smart clustering keys you can maybe do that. Basically if you insert data with a clustering key that always increases you can always just scan for clustering keys in a recent time gap. This will of course miss out-of-order inserts outside of your window. This may or may not be good enough for your use case.
Best answer in the future is Change Data Capture: https://issues.apache.org/jira/browse/CASSANDRA-8844

- 16,476
- 1
- 34
- 62
-
-
Does anyone have some idea on how spark-cassandra-connector can be used in this ? does it take of care of offset on its own and does it stream data in near real time ? – krish at Mar 01 '16 at 21:10
-
No, and no. It does neither of these things, as I said you would have to develop some custom code for your own tables (or design custom triggers and receivers) – RussS Mar 01 '16 at 21:30
-
1If you really need a streaming approach most find it easiest to insert into a queue and the using something like Spark Streaming to analyze data. – RussS Mar 01 '16 at 21:31
-
@RussS so future is already here! That ticket was resolved few months ago. Then do you have any ideas how to use it with Spark? – Vagif Jan 20 '17 at 00:55
-
-
-
@cholosrus that answer would be for paging a very long request, it would not Stream changes from C* which I believe was the intention. – RussS Jun 08 '20 at 18:13
I understand you were asking specifically about streaming data out of Cassandra, but I would like to suggest that a technology like Apache Kafka sounds like a much better fit for what you're trying to do. It is used by a number of other large companies and has fantastic real-time performance.
There is a seminal blog post by Jay Kreps called The Log: What every software engineer should know about real-time data's unifying abstraction that does a great job of explaining Kafka's purpose and design. A key quote from the blog post summarizes Kafka's role:
Take all the organization's data and put it into a central log for real-time subscription.

- 1
- 1
To stream the data from Cassandra, you want to use the PageSize option like so:
iter := cass.Query(`SELECT * FROM cmuser.users;`).PageSize(100).Iter()
the above is an example with Golang. The description for PageSize is:
PageSize will tell the iterator to fetch the result in pages of size n. This is useful for iterating over large result sets, but setting the page size too low might decrease the performance. This feature is only available in Cassandra 2 and onwards.
-
don't do this if you have significant amounts of data in the tables... – Alex Ott Jun 04 '20 at 09:42
-
why? The pagesize should help prevent too much data being read into the golang process – Jun 04 '20 at 19:13
-
Because is need to go to different nodes, as there is no load balancing, etc. this creates a load on coordinator... better solution is to make scan by token ranges, and route them explicitly to replicas – Alex Ott Jun 04 '20 at 19:17
-
-
the answer about `select * from table` is really doesn't answer the original question - author wants to have only changes since last scan. The best answer is from @RussS... If you interested in token range scan - I have a separate answer about it: https://stackoverflow.com/questions/54104383/performance-of-token-range-based-queries-on-partition-keys/54108700#54108700 – Alex Ott Jun 06 '20 at 08:58