I'm analyzing the backpressure feature on Spark Structured Streaming. Does anyone know the details? Is it possible to tune process incoming records by code? Thanks

- 322,348
- 103
- 959
- 935

- 197
- 2
- 10
-
How would you define a backpressure? – Jacek Laskowski Jul 03 '17 at 04:50
-
I mean, the feature to dynamically manage the ingestion rate for the records. On Spark Streaming can be activated and you can work on kafka.maxRatePerPartition, if you use Kafka. And how about Structured Streaming? How does it work internally? Is it manageable by the programmer? – Aniello Guarino Jul 04 '17 at 15:48
2 Answers
If you mean dynamically changing the size of each internal batch in Structured Streaming, then NO. There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger
in File source, and maxOffsetsPerTrigger
in Kafka source. Read the following links for more details:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

- 20,270
- 4
- 37
- 59
Handling back pressure is needed only is push based mechanisms. Kafka consumers are pull based, spark will pull next batch of records only when current batch is finished processing and saving. If processing & saving is delayed in spark, it won't pull new batch of records so no need of back pressure handling.
maxOffsetsPerTrigger can change the number of records processed per spark batch set, backpressure.enabled changes rate of receiving, but that's not same as back pressure where you go and tell the source to slow dow.

- 805
- 1
- 10
- 12
-
1Then how come spark streaming (non-structured) has back pressure property? Do you mean to say it is push based? https://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements – AbhishekN Jul 29 '19 at 18:22
-
Kafka supports only pull based consumption https://stackoverflow.com/questions/39586635/why-is-kafka-pull-based-instead-of-push-based . So it's always pull irrespective of structured or non-structured streaming. – spats Jan 30 '20 at 23:12