Order of messages with Spark Executors

Question

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.

I am deploying this job in a cluster mode.

My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?

you would have to write your own to achieve ordering. do you have any timestamp as a part of message? — Rahul Sharma, Sep 18 '17 at 17:58

score 1 · Answer 1 · answered Sep 18 '17 at 18:26

The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.

Refer answer on how to extract timestamp from kafka message.

score 0 · Answer 2 · answered Sep 18 '17 at 17:58

To maintain order using single partition is the right choice, here are few other things you can try:

Turn off speculative execution

spark.speculation - If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

Adjust your batch interval / sizes such that they can finish processing without any lag.

Cheers !

Order of messages with Spark Executors

2 Answers2