1

So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:

Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.

So here I am :-) The jobId: 2017-08-29_13_30_03-3908175820634599728

Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?

I'm working with Dataflow-Version 1.9.0

Thanks Google Dataflow Team

Malte
  • 589
  • 5
  • 24
  • Something is definitely very odd with that job. I've marked it for internal investigation and we'll get back to you when we figure out what went wrong. Sorry about that! Is this a consistent failure? Or is the job running fine now? – Lara Schmidt Aug 30 '17 at 17:35
  • Due to the fact that the downloadpipeline is created inside a DoFn it was retried automatically and finished after ~6.5min. JobId: 2017-08-29_15_29_11-1856842692501462974 – Malte Sep 01 '17 at 13:08

1 Answers1

1

It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).

Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.

Lukasz Cwik
  • 1,641
  • 12
  • 14
  • How am I supposed to do this now? The Machines are created only for the time the job runs and it usually doesn't have any problems. Debugging this would be disastrous. See my comment under the question, the second run finished after 6.5min with no problems – Malte Sep 01 '17 at 13:11
  • We also cannot do any more than you can unfortunately without a heap profile (we do not have access to your worker since it's your machine). If it happens again, you should be able to grab the profile and we can investigate from there? – Lara Schmidt Sep 01 '17 at 20:33