Does DStream's RDD pull entire data created for the batch interval at one shot?

Question

I have gone through this stackoverflow question, as per the answer it creates a DStream with only one RDD for the batch interval.

For example:

My batch interval is 1 minute and Spark Streaming job is consuming data from Kafka Topic.

My question is, does the RDD available in DStream pulls/contains the entire data for the last one minute? Is there any criteria or options we need to set to pull all the data created for the last one minute?

If i have a Kafka topic with 3 partitions, and all the 3 partitions contains the data for the last one minute, will the DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?

Update:

In Which case DStream contains more than one RDD?

Jacek Laskowski · Accepted Answer · 2016-11-15T07:28:05.160

A Spark Streaming DStream is consuming data from a Kafka topic that is partitioned, say to 3 partitions on 3 different Kafka brokers.

Does the RDD available in DStream pulls/contains the entire data for the last one minute?

Not quite. The RDD only describes the offsets to read data from when tasks are submitted for execution. It is just like with the other RDDs in Spark where they are only (?) a description of what to do and where to find data to work on when their tasks are submitted.

If you however use "pulls/contains" in a more loose way to express that at some point the records (from the partitions at given offsets) are going to be processed, yes, you're right, the entire minute is mapped to offsets and the offsets are in turn mapped to records that Kafka hands over to process.

in all the Kafka topic partitions?

Yes. It's Kafka not necessarily Spark Streaming / DStream / RDD to handle it. A DStream's RDDs request records from topic(s) and their partitions per offsets, from the last time it queried to now.

The minute for Spark Streaming might be slightly different for Kafka since a DStream's RDDs contain records for offsets not records per time.

In which case DStream contains more than one RDD?

Never.

score 2 · Answer 2 · answered Nov 13 '16 at 22:27

I recommend to read more about DStream abstraction in Spark documentation.

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data [...]. Internally, a DStream is represented by a continuous series of RDDs.

I would add one point to that – don't forget that RDD itself is another layer of abstraction and so it can be divided into smaller chunks and spread across the cluster.

Considering your questions:

Yes, after each batch interval fires, there is a job with one RDD. And this RDD contains all the data from the previous minute.
If your job consumes Kafka stream with more partitions, all the partitions are consumed in parallel. So the result is that data from all partitions are processed in the subsequent RDD.

score 1 · Answer 3 · answered Nov 16 '16 at 07:31

One important thing which was overlooked is the fact that Kafka has multiple implementations for Spark Streaming.

One is the receiver based approach, which sets up a receiver on a selected Worker node and reads the data, buffers it and then distributes.

The other is the receiver-less approach, which is quite different. It consumes only offsets in the node running the driver, and then when it distributes the tasks it sends each each executor a range of offsets to read and process. This way, there is no buffering (hence, receiver-less), and each of the offsets are consumed by mutually exclusive executor processes running on the worker.

DStream pulls/contains all the data created for the last one minute in all the Kafka topic partitions?

In both approaches, it will. One the one minute interval hits, it will attempt to read the data from Kafka and spread it across the cluster for processing.

In which case DStream contains more than one RDD

As others said, it never does. Only a single RDD flows inside a DStream at a give interval elapse.

Does DStream's RDD pull entire data created for the batch interval at one shot?

3 Answers3