0

We are developing a pipeline in apache flink (datastream API) that needs to sends its messages to an external system using API calls. Sometimes such an API call will fail, in this case our message needs some extra treatment (and/or a retry).

We had a few options for doing this:

  • We map() our stream through a function that does the API call and get the result of the API call returned, so we can act upon failures subsequently (this was my original idea, and why i did this: flink scala map with dead letter queue)

  • We write a custom sink function that does the same.

However, both options have problems i think:

  • With the map() approach i won't be able to get exactly once (or at most once which would also be fine) semantics since flink is free to re-execute pieces of pipelines after recovering from a crash in order to get the state up to date.

  • With the custom sink approach i can't get a stream of failed API calls for further processing: a sink is a dead end from the flink APPs point of view.

Is there a better solution for this problem ?

1 Answers1

0

The async i/o operator is designed for this scenario. It's a better starting point than a map.

There's also been recent work done to develop a generic async sink, see FLIP-171. This has been merged into master and will be released as part of Flink 1.15.

One of those should be your best way forward. Whatever you do, don't do blocking i/o in your user functions. That causes backpressure and often leads to performance problems and checkpoint failures.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • 1
    Hello, thanks! i didn't think about making it async but will do. But then i still have my problem of the exactly once semantic when using an operator and the not being able to treat failed calls when using a custom sink ? – Frank Dekervel Jan 20 '22 at 10:47
  • 1
    @FrankDekervel Exact-once-semantics, in general, is impossible to achieve in REST communication between two services. Instead, you can add some kind of idempotencyId to your request and then have an idempotent consumer. Another approach would be to switch to communication via e.g. Kafka and use transactional API https://www.confluent.io/blog/transactions-apache-kafka/ – dswiecki Feb 17 '22 at 14:49