2

When I try to export a JSON object through BigQuery, when there is a field with a "null" value, it disappears from the results' download.

An example of downloaded query:

EXPORT DATA OPTIONS(
  uri='gs://analytics-export/_*',
  format='JSON',
  overwrite=true) AS


SELECT NULL AS field1  

Actual result is: {}

When the expected result is: {field1: null}

How to force an export with the null value like I showed on the expected result?

Oded .S
  • 1,081
  • 2
  • 11
  • 18

1 Answers1

1

For this OP you can use:

Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1

On Export DATA documentation there is no reference to an option that includes null values on output so I think you can go to feature request report page and create one request for it. Also, there are similar observations on other projects and points that it will not be supported yet, see details here.

There are many workarounds for this, let me show you 2 options, see below:

Option 1: Call directly from python using bigquery client library

from google.cloud import bigquery
import json

client = bigquery.Client()

query = "select null as field1, null as field2"
query_job = client.query(query)

json_list = {}
for row in query_job:
    json_row = {'field1':row[0],'field2':row[1]}
    json_list.update(json_row)
    
with open('test.json','w+') as file:
    file.write(json.dumps(json_list))

Option 2: Use apache beam dataflow with python and BigQuery to produce the desirable output

import argparse
import re
import json

import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions



def add_null_field(row, field):
  if field!='skip':
    row.update({field: row.get(field, None)})
  return row


def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--output',
        dest='output',
        required=True,
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

    with beam.Pipeline(options=pipeline_options) as p:

        (p
        | beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
        | beam.Map(add_null_field, field='skip')
        | beam.Map(json.dumps)
        | beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))

if __name__ == '__main__': 
  run()

To run it:

python -m export --output gs://my_bucket_id/output/ \
                 --runner DataflowRunner \
                 --project my_project_id \
                 --region my_region \
                 --temp_location gs://my_bucket_id/tmp/

Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.

Both options will produce you the output you are looking for:

{"field1": null, "field2": null}

Please let me know if it helps you and gives you the result you want to achieve.

Betjens
  • 1,353
  • 2
  • 4
  • 13
  • Thanks, I have changed the output schema and this problem is no longer happening. We are not using Appache Beam in our project, although it is a nice option though – Oded .S Nov 19 '21 at 07:29