1

We have quite a few avro files on GCP (total storage size in PBs) which have older schemas (containing "default":"null" on the header schema section for a few 'record' type columns). Now when we are trying to load those to BQ, BigQuery is not able to interpret those. The solution appears to be converting "default":"null" to "default":null.

We have written a couple of custom python codes to convert the header to the newer format (Using avro and fastavro libraries); but it's taking long time to process even a 1 GB file (25 mins)

As the file count is large, the process is going to run for months (Even with parallel processing). Is there an easy way to do it?

Dipan Saha
  • 11
  • 1
  • Do you have information on which parts do most of the long time build up? Is it downloading/uploading, unpacking, or scanning? This is an older read but it might address some concepts https://techblog.rtbhouse.com/2017/04/18/fast-avro/ – Pentium10 Jan 23 '23 at 22:18

0 Answers0