0

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.

I am deploying this job in a cluster mode.

My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?

fledgling
  • 991
  • 4
  • 25
  • 48

2 Answers2

1

The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.

Refer answer on how to extract timestamp from kafka message.

Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91
0

To maintain order using single partition is the right choice, here are few other things you can try:

  1. Turn off speculative execution

spark.speculation - If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

  1. Adjust your batch interval / sizes such that they can finish processing without any lag.

Cheers !

Sachin Thapa
  • 3,559
  • 4
  • 24
  • 42