1

I'm using the default dataflow template GCS to Pub/Sub. input files in cloud storage having size 300MB and 2-3 millions of rows each one.

when launching the dataflow batch job the following error is raised

Error message from worker: javax.naming.SizeLimitExceededException: Pub/Sub message size (1089680070) exceeded maximum batch size (7500000) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.processElement(PubsubIO.java:1160)

from the documentation : Pub/Sub accepts a maximum of 1,000 messages in a batch, and the size of a batch can not exceed 10 megabytes.

does it mean that i have to split input files to 10MB chunks or 1000 message to publish?

what is the recommended way to load such large files(300MB each one) to pubsub ?

Thanks in advance for your help.

MnR
  • 21
  • 3

1 Answers1

1

This is a known limitation in the Dataflow side, at this moment exist a feature request to increase the sizeof the batch size. Use the +1 button and star the issue to follow it progress.

I recommend you to check this post where a workaround is suggested. It is important to consider that this workaround implies the modification of the Cloud Storage Text to Pub/Sub template to implement the custom transform mentioned there.

On the other hand, you can try to create cloud function to split your file before to be processed by Dataflow, I thought something like :

  1. Create a "staging" bucket to upload your large files.
  2. Write a Cloud Function to split your files and write the small chunks in another bucket. You can try using the filesplit Python package to do that.
  3. Trigger the Cloud Function to run every time you upload a new file in the "staging" bucket with Google Cloud Storage Triggers.
  4. Once the file was split in small chunks, delete the large file from the "staging" bucket with the same Cloud Function to avoid extra charges.
  5. Use the Dataflow template Cloud Storage Text to Pub/Sub to process the small chunks for the second bucket.
Enrique Zetina
  • 825
  • 5
  • 16