2

In Spark Streaming, using Scala or Java, how can I have a stream that always contains the most recent N elements?

I'm familiar with the possibility to create windowed streams, using methods such as window(windowLength, slideInterval). Here, both windowLength and slideInterval are time intervals. Is there an equivalent where you specify a row (i.e. element) count for the length and interval of a stream?

If there is no provided API for such cases, how would you go about implementing it yourself?

ig-dev
  • 489
  • 1
  • 5
  • 15
  • As in the last value for a given field or top N? E.g. for all customers, the total sales to-date you mean? Unclear. – thebluephantom Mar 27 '19 at 09:53
  • @thebluephantom In a continuous stream of live data (such as, data from Kafka), the `N` `DataFrames` or `DataSets` that that have been emitted most recently. Does that clear it up for you? – ig-dev Mar 27 '19 at 11:00
  • not really, provide an example. just copy you mean? – thebluephantom Mar 27 '19 at 11:08
  • I doubt code makes it clearer if that description didn't do it. Anyway, for example analogous to `DStream::reduceByWindow`, the following: `val intStream: DStream[Int] = createStream(); val sumOfLastValues: DStream[Int] = intStream.reduceByCount(howMany, howOften, _ + _)` – ig-dev Mar 27 '19 at 11:23
  • Examples are better than code imho. – thebluephantom Mar 27 '19 at 12:07
  • That code IS an example. I'm afraid that's all I can do – ig-dev Mar 27 '19 at 12:15
  • I guess we will have to see how others see it. Good luck. – thebluephantom Mar 27 '19 at 12:16
  • @thebluephantom I had another idea as an example: Continuously calculate the the moving average of a DStream[Int] (or structured stream with Ints) for further processing, creating a new stream of moving average Ints. If you can solve that efficiently, you are a ninja. I tried for hours – ig-dev Mar 29 '19 at 03:19
  • https://stackoverflow.com/questions/23402303 – thebluephantom Mar 29 '19 at 12:44
  • Thanks, but that only works on a finite `RDD` not on a continuous stream like a `DStream`, as much as I'd like it to – ig-dev Mar 29 '19 at 13:22
  • But it does state Spark Streaming and there are many links. – thebluephantom Mar 29 '19 at 13:29
  • Unless that's the wrong link, the phrase "stream" does not occur once on the linked page, and that one only works for `RDD`s – ig-dev Mar 29 '19 at 13:35
  • rephrase your question as to wgat you seek with which streaming approach – thebluephantom Mar 30 '19 at 06:51
  • I note no answer(s), so I suggest you re-phrase before I give you the possible solution. – thebluephantom Apr 03 '19 at 07:23
  • The question is clear, and I have provided multiple clarifications and examples for you in the comments. If you have learned something since your last attempted solution, feel free to post an answer – ig-dev Apr 03 '19 at 11:28
  • But what about this comment: I had another idea as an example: Continuously calculate the the moving average of a DStream[Int] (or structured stream with Ints) for further processing, creating a new stream of moving average Ints. If you can solve that efficiently, you are a ninja. I tried for hours – ig-dev Mar 29 at 3:19 – thebluephantom Apr 03 '19 at 15:27
  • What about it? You asked for an example – ig-dev Apr 03 '19 at 18:50

0 Answers0