1

I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.

I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:

{
  "id": 12345,
  "@address": "XXXXXX"
  ...
}

Of course the "@address" key throws everything off and the cloud function errors out ...

Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.

I am contemplating the option of:

  • Loading the file in memory into a string var
  • Replace @address with address
  • Convert the json new line delimited to a list of dictionaries
  • Use bigquery stream insert to insert the rows in BQ

But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.

What other options do I have?

And no, I cannot modify the legacy system to rename the "@address" field :(

Thanks!

Neurus
  • 657
  • 4
  • 27
  • What exactly is the error you're getting? Instead of kicking off the load job directly from the Cloud Function, it would be prudent to have the Cloud Functions kick off a Dataflow pipeline (template) and have that do the load. – Graham Polley Jan 12 '18 at 07:56
  • hey @GrahamPolley sorry, yes, the error is the one you mention in the answer. Let me read the answer and I'll get back to you - thanks! – Neurus Jan 12 '18 at 19:27

1 Answers1

1

I'm going to assume the error that you are getting is something like this:

Errors: query: Invalid field name "@address". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.

Here's one way of doing it, which is completely serverless:

  1. Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
  2. Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
  3. In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "@" before creating your output TableRow object.
  4. Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.

To sum up, you'll need the following in the conga line:

File > GCS > Cloud Function > Dataflow (template) > BigQuery

The advantages of this:

  1. Event driven
  2. Scalable
  3. Serverless/no-ops
  4. You get monitoring alerting out of the box with Stackdriver
  5. Minimal code

See:

disclosure: the last link is to a blog which was written by one of the engineers I work with.

Graham Polley
  • 14,393
  • 4
  • 44
  • 80
  • Thanks Graham - I'm building the pipeline as suggested, although I'm writing the dataflow in Python and having issues with the templated jobs: https://stackoverflow.com/q/48308693/399302 do you happen to know DF for Python? – Neurus Jan 23 '18 at 20:32