0

I am having a batch interval of 5 seconds.
I want to look at the number of rdd's formed in one batch. So i added a time inside forEach to print the time in seconds and count rdd's after 5 seconds.

  textStream.foreachRDD(rdd =>{
  println("======="+ TimeUnit.MILLISECONDS.toMinutes(Instant.now.toEpochMilli))
  rdd.foreach(println(_))
})

This gives the same time (currentl empty input):

=======26461220
=======26461220
=======26461220
=======26461220

The time should change right? Q1. How to print the time of the current?
Q2. How many rdd's are formed in a dstream ?

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
supernatural
  • 1,107
  • 11
  • 34

1 Answers1

1

Q1. How to print the time of the current?

You could simply use System.nanoTime()

textStream.foreachRDD(rdd => {
  rdd.foreach(println(System.nanoTime())
})

Q2. How many rdd's are formed in a dstream ?

You will get one RDD for each batch interval. The batch interval is set in your configuration of the SparkSession. The stream is called a DStream which is a sequence of individual RDDs.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • one RDD for each batch interval you mean for each second? Say, I have a batch interval of 5 seconds, does that suppose to mean 5 rdds ? – supernatural Apr 24 '20 at 04:49
  • In that case you would have one RDD containing the data of the last 5 seconds. – Michael Heil Apr 24 '20 at 04:54
  • okay, then why is there a dstream.foreachRDD ( which is it will loop into for every rdd in that batch right ?) , How will it iterate over if there is only one rdd ? – supernatural Apr 24 '20 at 04:59
  • The dstream generates many RDDs (one for each batch interval) an foreachRDD iterates through them. – Michael Heil Apr 24 '20 at 05:03
  • With a batch interval of 5 seconds, you would get 4 RDDs after 20 seconds. – Michael Heil Apr 24 '20 at 05:06
  • so,no matter how much batch interval is given(say 300), then in one second 1 rdd is formed, then total a batch of 5min will give total 300 rdds? – supernatural Apr 24 '20 at 05:14
  • you mentioned - `The dstream generates many RDDs (one for each batch interval) an foreachRDD iterates through them`, Here you mentioned one rdd for each batch interval. Could you please ellaborate `each batch interval`, does it mean 1 second or a batch interval. You mean a second right ? – supernatural Apr 24 '20 at 05:17
  • I really do not know from where you have this 1 second. I was never mentioning anything about one second :-) batch interval is the time that you set when you create your stream. Can you explain where you have the 1 second from? – Michael Heil Apr 24 '20 at 05:35
  • Here is a good answer that explains RDD and foreachRDD (https://stackoverflow.com/questions/36421619/whats-the-meaning-of-dstream-foreachrdd-function) – Michael Heil Apr 24 '20 at 05:58