How to extract a value as a column from JSON with multiple key-value lists using (a materialized view compatible) SQL?

Question

There is a table with measurements. There is a column in this table with the name measurement (JSON type). It contains lists of a named parameter values.

A sample table with one key-value list called parameters can be defined as follow:

select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement

Same in the table form:

id	measurement
1	{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}
2	{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}
3	{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}

Now, I want to extract some values from the list to columns. For example, if I want parameters aaa and bbb, I would expect the output like:

id	aaa	bbb
1	10	20
2	11	22
3	111	222

CASE-WHEN-GROUP-BY method

I can achieve this using 4 sub-queries. It starts already getting complex, but still bearable:

with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
),
parameters AS (select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) measurements_list from measurements),
param_values as (select id, JSON_VALUE(ml.name) name, JSON_VALUE(ml.value) value from parameters, parameters.measurements_list ml),

trimmed_values as (select id, case when name="aaa" then value else null end as aaa, case when name="bbb" then value else null end as bbb
from param_values where name in ("aaa", "bbb"))

select id, max(aaa) aaa, max(bbb) bbb from trimmed_values group by id

JSONPath method

I can also use full-featured JSONPath function as suggested by Mikhail. Then things start looking more manageable:

select id,
  bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
  bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb
from `sap-clm-analytics-dev.ag_experiment.measurements`

(It may be less efficient then the CASE-WHEN-GROUP-BY method because of the external UDF call, but let's focus on maintainability for now).

Adding another list of values

Now I add another list of key-value pairs named colors:

select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement

Let's pick values for green from the list of colors. Then the output will be:

id	aaa	bbb	green
1	10	20	A
2	11	22	AA
3	111	222	AAA

The JSONPath solution from above can be trivially extended to cover this case:

select id,
  bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
  bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb,
  bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.colors), '$.[?(@.name=="green")].value') bbb
from measurements

With the CASE-WHEN approach things starts getting tricky. The below query already gets complex and is simply wrong:

with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement),

parameters_colors AS (
select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) parameters_list, JSON_QUERY_ARRAY(measurements.measurement.colors) colors_list from measurements),

param_color_values AS (select id, JSON_VALUE(parameters_list.name) param_name, JSON_VALUE(parameters_list.value) param_value, JSON_VALUE(colors_list.name) color_name, JSON_VALUE(colors_list.value) color_value  from parameters_colors, parameters_colors.parameters_list, parameters_colors.colors_list),

trimmed_values AS (select id,
    case when param_name="aaa" then param_value else null end as aaa,
    case when param_name="bbb" then param_value else null end as bbb,
    case when color_name="green" then color_value else null end as green,
from param_color_values where param_name in ("aaa", "bbb") and color_name = "green")

select id, max(aaa) aaaa, max(bbb) bbb, max(green) green from trimmed_values group by 1

Wrong result:

id	aaa	bbb	green
1	10	20	A
2	10	20	AA
3	10	20	AAA

The cartesian product in param_color_values is fine, but trimmed_values incorrectly fill permutations with nulls. Apparently the level of dependency is needed for "green" values.

It would be apparently possible to fix my example, but it probably won't be maintainable after another list of parameters. So, I want to phrase my question differently.

Question

What would be a maintainable way to extract multiple values from such data structures in SQL?

Materialized view

Ideally, I'd like to persist such query as a BigQuery materialized view. The original data object is huge, so I want to create a stage in the data pipeline, which persists a curated subset of it, differently clustered. I want that the BigQuery manages the refreshes of this object. Materialized view has a limited set of functions. For examples, UDFs (like CUSTOM_JSON_PATH) are not supported.

My current state

I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.

Do I oversee any trivial pure SQL solution, which is optionally materialized-view compatible and easy to extend to more complex cases?

score 2 · Answer 1 · answered Aug 20 '22 at 18:54

I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.

Consider below approach (not compatible with materialized view)

create temp function get_keys(input string) returns array<string> language js as """
  return Object.keys(JSON.parse(input));
  """;
create temp function get_values(input string) returns array<string> language js as """ 
  return Object.values(JSON.parse(input));
  """;
create temp function get_leaves(input string) returns string language js as '''
  function flattenObj(obj, parent = '', res = {}){
    for(let key in obj){
        let propName = parent ? parent + '.' + key : key;
        if(typeof obj[key] == 'object'){
            flattenObj(obj[key], propName, res);
        } else {
            res[propName] = obj[key];
        }
    }
    return JSON.stringify(res);
  }
  return flattenObj(JSON.parse(input));
  ''';
with temp as (
  select id, val, --key, val, --leaves
    if(ends_with(key, '.name'), 'name', 'value') type,
    regexp_replace(key, r'.name$|.value$', '') key
  from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
  unnest(get_keys(leaves)) key with offset
  join unnest(get_values(leaves)) val with offset using(offset)
)
select * from (
  select * except(key) 
  from temp 
  pivot (any_value(val) for type in ('name', 'value'))
)
pivot (any_value(value) for name in ('aaa', 'bbb', 'ccc', 'green', 'yellow') )

if applied to sample data in your question - output is

In case if keys are not known in advance or too many to manually manage - you can use below dynamic version

create temp function get_keys(input string) returns array<string> language js as """
  return Object.keys(JSON.parse(input));
  """;
create temp function get_values(input string) returns array<string> language js as """ 
  return Object.values(JSON.parse(input));
  """;
create temp function get_leaves(input string) returns string language js as '''
  function flattenObj(obj, parent = '', res = {}){
    for(let key in obj){
        let propName = parent ? parent + '.' + key : key;
        if(typeof obj[key] == 'object'){
            flattenObj(obj[key], propName, res);
        } else {
            res[propName] = obj[key];
        }
    }
    return JSON.stringify(res);
  }
  return flattenObj(JSON.parse(input));
  ''';
create temp table temp as (
  select * except(key) from (
    select id, val, 
      if(ends_with(key, '.name'), 'name', 'value') type,
      regexp_replace(key, r'.name$|.value$', '') key
    from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
    unnest(get_keys(leaves)) key with offset
    join unnest(get_values(leaves)) val with offset using(offset)
  )
  pivot (any_value(val) for type in ('name', 'value'))
);
execute immediate (select '''
select * from temp
pivot (any_value(value) for name in ("''' || string_agg(distinct name, '","') || '"))'
from temp
);

Thank you Mikhail very much for the quick response! Your contributions to BigQuery community on StackOverflow are amazing and I'm happily using JSONPath approach from your post I linked. — Anton Golubev, Aug 20 '22 at 20:05
To your suggestion. Key-value lists can be on any level of JSON hierarchy, even embedded one into another. Would it be in this circumstances still beneficial to use a custom-made getter you suggest (maintenance, adaptation to new use-cases), given the existing _JSONPath_ syntax covers all kind use-cases and foreseeable future use-cases? — Anton Golubev, Aug 20 '22 at 20:05
And generally, can I assume from your answer that you also prefer the UDF solution over the pure SQL solution for this case as it is cleaner and easier maintainable? — Anton Golubev, Aug 20 '22 at 20:05
I appreciate your answer! This a nice original method of tackling this problem: use JS UDF for easy extraction of key-value lists from original JSON and SQL _PIVOT_ for transposing narrow list into wide list. I learned something, thank you! — Anton Golubev, Aug 20 '22 at 20:23