Spark Streaming historical state

Question

I am building real time processing for detecting fraud ATM card transaction. in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.)

One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud

So tried to look at Spark streaming as a solution. In order to achieve this (probably I am missing idea about functional programming) below is my psudo code

stream=ssc.receiverStream() //input receiver 
s1=stream.mapToPair() // creates key with card and transaction date as value
s2=stream.reduceByKey() // applies reduce operation for last transaction date 
s2.checkpoint(new Duration(1000));
s2.persist();

I am facing two problem here

1) how to use this last transaction date further for future comparison from same card
2) how to persist data so even if restart drive program then old values of s2 restores back 3) updateStateByKey can used to maintain historical state?

I think I am missing key point of spark streaming/functional programming that how to implement this kind of logic.

Totally lost on the question here, are you having trouble saving the data to a file — aaronman, Jun 20 '14 at 17:37
@aaronman it may be not that simple in a distributed environment with a dynamically changing workers ;-) — om-nom-nom, Jun 20 '14 at 18:27
@om-nom-nom i'm just not clear as to what the problem is, as for saving files in a streaming context spark lets you save a file for each Dstream you process without too much effort — aaronman, Jun 20 '14 at 19:00
Yes aaronman i want to store/update state and use it when next batch arrives.. — Jigar Parekh, Jun 20 '14 at 19:04
@aaronman if i store it as file then how it can be used for next batch — Jigar Parekh, Jun 20 '14 at 19:15
@JigarParekh there are many ways, I still don't fully understand your use case — aaronman, Jun 20 '14 at 19:18
One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud — Jigar Parekh, Jun 20 '14 at 19:24
@om-nom-nom not getting your point. I am assuming if i can store historical state in rdd then can use that info with future dstreams.. But not sure how i can do that — Jigar Parekh, Jun 21 '14 at 04:08
See the Spark Streaming guide sections on [Persistence and RDD Checkpointing](https://spark.apache.org/docs/latest/streaming-programming-guide.html#persistence) for some hints. Maybe also consider [window operations](https://spark.apache.org/docs/latest/streaming-programming-guide.html#operations)? — Karl Higley, Jun 21 '14 at 14:08
@KarlHigley i have already set checkpoint interval and also invoked persist method after each RDD transformation. window operation i doubt because can be used because transaction date can be a year back too. — Jigar Parekh, Jun 21 '14 at 16:18
@JigarParekh As it stands, IHMO this is a question too broad and with too few inputs to be solvable in this forum. I think you need to do your homework and come back with more specific questions. Solving this problem has less to do with 'getting functional programming' as it has to do with the volume of data and processing requirements (and both are unclear in the problem statement). — maasg, Jun 22 '14 at 10:52
Thx all, for help. finally i have figured out solution by looking at scala examples StatefulNetworkWordCount & RecoverableNetworkWordCount, will post solution in answer for others. now I can see why my question was confusing. — Jigar Parekh, Jun 24 '14 at 05:27

score 3 · Accepted Answer · answered Jun 24 '14 at 05:28

If you are using Spark Streaming you shouldn't really save your state on a file, especially if you are planning to run your application 24/7. If that is not your intention, you will be probably be fine with just a Spark application since you are facing only big data computation and not computation over batches coming real time.

Yes, updateStateByKey can be used to maintain state through the various batches but it has a particular signature that you can see in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions

Also persist() it's just a form of caching, it doesn't actually persist your data on disk (like on a file).

Hope to have clarified some of your doubts.

Is there any way to delete/reset state of the key when streaming is running 24/7.. my application gets killed over a period of time.. how to handle it? — mithra, Oct 26 '14 at 05:01

Spark Streaming historical state

1 Answers1