6

What's the best practices when http calls from a DoFn, in a pipeline that will be running in Google Cloud Dataflow? (Java)

I mean, if in a pure Java w/o using Beam, I need to think about things like async calls, or at least multithreading. think about manage the thread pool, connection pool... With Dataflow, what would happen if I just have one thread make sync call in each ProcessElement? What's the best practices to do http calls in the DoFn?

foxwendy
  • 2,819
  • 2
  • 28
  • 50
  • 1
    It depends on the user case one good practice is to avoid making the call at all. Sometimes its possible to download all the data and have it as input to the pipeline as well. When you make HTTP calls in dataflow you block, and if the request is long, then you can appear to stall your pipeline sometimes. – Alex Amato May 14 '18 at 23:56
  • 2
    Alternatively, consider doing some batching with a ParDo by implementing startBundle, processElement and finishBundle. In startBundle, create a data structure. in processElement, add elements to the structure, in finishBundle make the HTTP request for all of the elements in one call. Of course your service needs to support calling in bundles, rather than for just one element. If this is possible, consider this approach as well. – Alex Amato May 14 '18 at 23:58
  • 1
    [This](https://stackoverflow.com/a/47560224/9251751) answer in an older question proposes as a best practice for hhtp calls in a DoFn putting the client into a member variable, using the `@Setup` method to open it and `@Teardown` to close it. Would this suit your case? – Lefteris S May 15 '18 at 08:38
  • @AlexAmato what about using async calls? please check this answer here: https://stackoverflow.com/questions/49884949/how-to-do-async-http-call-with-apache-beam-java/51154894#51154894 I don't really know if async makes sense to dataflow, whether we can just treat a dataflow "thread" as a normal thread and in turn normal async is totally the same for dataflow as for a normal application – foxwendy Jul 04 '18 at 21:00
  • @LefterisS I tested my DoFn. The setting has only one worker, with 32vCPU the logic in `@Setup` actually was called 16 times. Should it be called only once? Any clue why? – foxwendy Jul 05 '18 at 18:26
  • `@Setup` is called per instance of the `DoFn`. Since there are multiple threads, unique instances of the `DoFn` are created so that state each `DoFn` instance is independent and each `DoFn` instance can be used in separate threads without needing synchronization. – Lukasz Cwik May 21 '19 at 20:51

0 Answers0