0

I'm using spark-streaming for below use-case :

  1. I've a kafka topic - data. From this topic, I'm streaming in real-time data using structured spark streaming and apply some filters on it. If the number of rows returned after applying the filters is greater than 1 then the output is 1 else the output is 0 along with some other data from the query.

    In simple words, suppose I'm filtering the stream using -

    df.filter($A < 10) 
    

    where "A", "<" and "10" are dynamic and comes from some database. In actual, these values comes from kafka topic which I'm consuming and updating those values in db. So the query is not static and will be updated after sometime.

  2. Further, I'll have to apply some boolean algeric operators on the results of streams. For eg -

    df.filter($A < 10) AND df.filter($B = 1) OR df.filter($C > 1)... and so on
    

    Here, each of the atomic operation (like df.filter($A < 10)) returns either 0 or 1 as described above. Final result is saved to mongo.

I want to know if both problems can be used using structured spark streaming. If not, then using RDD ?

Otherwise, can someone suggest any way to do this ?

Ishan
  • 996
  • 3
  • 13
  • 34

1 Answers1

0

For the first case you can use a broadcast variable based approach as described in this answer. I've also had good luck using a per-executor transient value that was periodically refetched in each executor as described in the second part of this answer.

For the second case you would use a single filter() call that implements the complete set of conditions that causes a message to be included in the output stream.

Joeri Sebrechts
  • 11,012
  • 3
  • 35
  • 50
  • Thanks for your response. Can you please explain it a bit more ? And suppose I want to broadcast the values of my filter. But I'm not getting when would I update the broadcast variable then ? I want to do it when another kafka consumer reads some new data from another topic. – Ishan Jul 10 '18 at 09:22
  • The broadcast variable approach uses a TTL approach to fetch data from an external system periodically. You cannot "push" it to refresh, you have to wait for it to refresh. If the data that you want is instead another stream, you are probably better off looking at ways of joining against that stream. – Joeri Sebrechts Jul 10 '18 at 11:55