Datastore with huge number of read and write and integration performance with Spark Structured Streaming

Question

I have a use case where around 150 million records are stored in NoSQL Datastore. There might be a bunch of new inserts or updates happen in each day, say in order of 10K and 20-25 million respectively. And these updates are subject of Spark Structured streaming. I used HBase as an initial solution but I'm not sure whether it's the best choice. Here while performing the biz logic join operation takes place and Spark has to read all those 150 million records but twice a day. On the other hand, there are around 25-30K records/sec are streaming continuously which has to be updated in Datastore after the join. I went through this article. What Datastore would be the best choice considering the performance and also the Spark Structured streaming integration?

score 0 · Answer 1 · answered Jul 27 '19 at 10:42

HBase is a KV store and is in fact suitable for this.

But if I understand your approach, you seem to want to do JOINing. Thsi is of course not the approach. Too much data and thus time elapsed for a microbatch, even with caching. JOINing only works with small reference tables (from Hive, KUDU).

You need something akin to this:

val query = ds.writeStream
              .foreach(new HBaseForeachWriter ...

See Spark Structured Streaming with Hbase integration for guidance and you should be on your way.

Datastore with huge number of read and write and integration performance with Spark Structured Streaming

1 Answers1