Would really appreciate if someone could help out here, we have just started to look into GCP and need a robust and easy pattern to load transactional data in xml format data published on Cloud Pub/Sub into a date partitioned BigQuery table for usage in complex downstream batch processing orchestrated by AirFlow.
Has anyone did this before?
To allow for schema drift on ingestion side one option would be to convert xml to json and store the json as a string with a BQ view on top using json functions to extract fields for downstream processing, what are the pros/cons with this approach?
One pros in our case is that there are a lot (300+) fields in the xml but only a subset is used initially but over time we need to be able to “turn on” new fields fast.
Maybe go one step further and store raw xml in BQ and use BQ sql + UDF to convert to json?
Any hints much appreciated, thanks!
/Mattias