What is the best way to upgrade avro files (stored on GCS) having older schemas (containing "default":"null") to newer formats (with "default":null)

Asked Jan 23 '23 at 19:57

Active Jan 23 '23 at 19:57

Viewed 47 times

We have quite a few avro files on GCP (total storage size in PBs) which have older schemas (containing "default":"null" on the header schema section for a few 'record' type columns). Now when we are trying to load those to BQ, BigQuery is not able to interpret those. The solution appears to be converting "default":"null" to "default":null.

We have written a couple of custom python codes to convert the header to the newer format (Using avro and fastavro libraries); but it's taking long time to process even a 1 GB file (25 mins)

As the file count is large, the process is going to run for months (Even with parallel processing). Is there an easy way to do it?

asked Jan 23 '23 at 19:57

Dipan Saha

Do you have information on which parts do most of the long time build up? Is it downloading/uploading, unpacking, or scanning? This is an older read but it might address some concepts https://techblog.rtbhouse.com/2017/04/18/fast-avro/ – Pentium10 Jan 23 '23 at 22:18

What is the best way to upgrade avro files (stored on GCS) having older schemas (containing "default":"null") to newer formats (with "default":null)

0 Answers0