1

I have about 4000 files (avg ~7MB each) input.

My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB. I tried to limit only use 300 file then it run just fine.

In case of fail, the logs on GCP dataflow only show:

Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: 
  store-migration-10212040-aoi4-harness-m7j7
      Root cause: The worker lost contact with the service.,
  store-migration-xxxxx
      Root cause: The worker lost contact with the service.,
  store-migration-xxxxx
      Root cause: The worker lost contact with the service.,
  store-migration-xxxxx
      Root cause: The worker lost contact with the service.

I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.

Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.

Thanks.

khiem.nix
  • 11
  • 1
  • 1
  • that's interesting! Thanks for sharing. `The worker lost contact with the service.` messages are common when the worker is suffering high pressure on memory. Can you share more details about your pipeline, and about the function coming after the CoGBK? – Pablo Oct 22 '20 at 18:40
  • Agree with Pablo, it looks like a memory issue. Do you have hot keys? Have you tried machines with more memory? – Iñigo Oct 23 '20 at 23:14
  • @Pablo I tried `n1-highmem-4` and `-8` and it still crashed. The GroupByKey within it said it have ~15GB mem data, which is less then `-8` and it still crash there. – khiem.nix Nov 02 '20 at 04:55

0 Answers0