Structured Streaming extract most recent values for each id

Question

I have datastream containing ID, type, and value: For a group of users with given ID I receive measurements (values) from different sensors (type). Example of incoming data:

ID type value
1  A    70
2  B    16
1  A    71
2  A    72

I need to create Spark Structured Streaming app that will perform custom clustering of the obtained data. However, I am stuck at the begining> I don't know how to create a set of data that will contain the last measurements for each user for each type. I need to have this set for every user that has ever appeared in the system.

So, basically, for a data stream described above, I need a Structured Streaming app that will give me a set of last measurements for every user for every type>

  ID type value
  1  A    71
  2  B    16
  2  A    72

Users may be inactive for some time, I still need to keep their record. It would be useful if the output is a dataframe.

Any ideas for how to do this will be very welcome.

PS I am fairly new to Spark Structured Streaming, sorry if this is a trivial question.

How can really see what the last measurement is? You need timestamp of some sort surely? — thebluephantom, Feb 07 '19 at 14:59
None the less it will not work. I tried all tricks, Look at those links, just 2 I checked. not actually realistic to do. — thebluephantom, Feb 08 '19 at 21:43
@thebluephantom Not exactly. There are workarounds, but nothing clean yet — Caca, Feb 24 '19 at 19:53

thebluephantom · Answer 1 · 2019-02-08T08:59:02.853

2

The short answer is: this is not possible with Spark Structured Streaming (currently).

Many posts on this and none have suggested a solution that actually works.

When you think about it, in reality it is a tall order.

I tried various approaches - even though I knew it was not possible - and always got some sort of error from Spark. These are documented on Stack Overflow at length. E.g.:

Structured streaming custom deduplication

Retain last row for given key in spark structured streaming

edited Feb 08 '19 at 08:59

answered Feb 07 '19 at 21:21

thebluephantom

16,458
8
40
83

Would you suggest me to drop Structured Streaming and try to solve this with RDDs? Or there is smarter solution :) – Caca Feb 08 '19 at 21:42
You could write to data store and read back in to dataframe and process and hence trivial. But not a streaming use case. – thebluephantom Feb 08 '19 at 21:45
I should not via a streaming use case. – thebluephantom Feb 10 '19 at 12:48

Structured Streaming extract most recent values for each id

1 Answers1