4

In Spark 3 Behave of backpressure option on Kafka and File Source for trigger.once scenario was changed.

But I have a question. How can I configure backpressure to my job when I want to use TriggerOnce?

In spark 2.4 I have a use case, to backfill some data and then start the stream. So I use trigger once, but my backfill scenario can be very very big and sometimes create too big a load on my disks because of shuffles and to driver memory because FileIndex cached there. SO I use max maxOffsetsPerTrigger and maxFilesPerTrigger to control how much data my spark can process. that's how I configure backpressure.

And now you remove this ability, so assume someone can suggest a new way to go?

Grigoriev Nick
  • 1,099
  • 8
  • 24

1 Answers1

1

Trigger.Once ignores these options right now (in Spark 3), so it always will read everything on the first load.

You can workaround that - for example, you can start stream with trigger set to periodic, with some value like, 1 hour, and don't execute .awaitTermination, but have a parallel loop that will check if first batch is done, and stop the stream. Or you can set it to continuous mode, and then check if batches having 0 rows, and then terminate the stream. After that initial load you can switch stream back to Trigger.Once

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • it's a very dirty solution, bring some multithread complexity to my code. I also can throw exceptions on every second batch and handle stream shutdown on this specific exception. But still I expect to learn how to do it without HACKs. – Grigoriev Nick Mar 24 '21 at 11:04
  • I can't believe that Spark developer removes the feature of backpressure for Trigger>once and do not provide any good replacement for it. – Grigoriev Nick Mar 24 '21 at 11:09
  • Are you sure that it worked before? In this solution you don't need to do any multithreading, just don't use `awaitTermination` – Alex Ott Mar 24 '21 at 13:03
  • I know that dev team is working on that... but don't know in which Spark version it will land – Alex Ott Mar 24 '21 at 16:37
  • Previously Trigger.Once just generate a batch. This batch respects all config in the same way as Any other trigger. And now I need to do a dirty hack, to stop when the first batch pass. The main Issue that If I will fail the batch when it already started, offset WAL will be present, this is an issue for me. Because I can't override config for the next batch. – Grigoriev Nick Mar 24 '21 at 16:38
  • you're correct, it works just fine in the Spark 2.4... until it's fixed correctly, that workaround is what I can offer... – Alex Ott Mar 24 '21 at 17:05
  • Ok, So Databricks does not have any way to control backpressure in trigger.once scenario? And it's `a fix` -> Introduce coupling between scheduling and backpressure options for stream batches? – Grigoriev Nick Mar 25 '21 at 19:12