Guidance howto load streaming xml into BigQuery

Question

Would really appreciate if someone could help out here, we have just started to look into GCP and need a robust and easy pattern to load transactional data in xml format data published on Cloud Pub/Sub into a date partitioned BigQuery table for usage in complex downstream batch processing orchestrated by AirFlow.

Has anyone did this before?

To allow for schema drift on ingestion side one option would be to convert xml to json and store the json as a string with a BQ view on top using json functions to extract fields for downstream processing, what are the pros/cons with this approach?

One pros in our case is that there are a lot (300+) fields in the xml but only a subset is used initially but over time we need to be able to “turn on” new fields fast.

Maybe go one step further and store raw xml in BQ and use BQ sql + UDF to convert to json?

Any hints much appreciated, thanks!

/Mattias

this might help, although I'd recommend using Java and not Python when working with Cloud Dataflow: https://medium.com/@lakshmanok/how-to-load-xml-data-into-bigquery-using-python-dataflow-fd1580e4af48. _"Maybe go one step further and store raw xml in BQ and use BQ sql + UDF to convert to json?"_ I was thinking the same and going to ask Lak (the author) if this was an acceptable approach in his opinion. — Graham Polley, Jun 29 '20 at 14:19

score 0 · Answer 1 · answered Jul 01 '20 at 07:35

You can use Google Cloud Dataflow to read each Pub/Sub message in XML format and parse it, for example in JSON format or Python dict format, some format BigQuery can understand.

Here you can find an example of reading an XML file, parsing it into a dictionary and then writing it into a BigQuery table [1]. Yo only need to adapt the code to read from Pub/Sub and then parse each message as it is done in [1] to a dictionary or JSON, here you can find an example on how to read a PubSub message, parse it to a format that BigQuery can understand and write in it [2].

[1] https://medium.com/google-cloud/how-to-load-xml-data-into-bigquery-using-python-dataflow-fd1580e4af48

[2] Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output

Guidance howto load streaming xml into BigQuery

1 Answers1