There is a table with measurements. There is a column in this table with the name measurement
(JSON type). It contains lists of a named parameter values.
A sample table with one key-value list called parameters can be defined as follow:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
Same in the table form:
id | measurement |
---|---|
1 | {"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]} |
2 | {"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]} |
3 | {"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]} |
Now, I want to extract some values from the list to columns. For example, if I want parameters aaa
and bbb
, I would expect the output like:
id | aaa | bbb |
---|---|---|
1 | 10 | 20 |
2 | 11 | 22 |
3 | 111 | 222 |
CASE-WHEN-GROUP-BY method
I can achieve this using 4 sub-queries. It starts already getting complex, but still bearable:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
),
parameters AS (select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) measurements_list from measurements),
param_values as (select id, JSON_VALUE(ml.name) name, JSON_VALUE(ml.value) value from parameters, parameters.measurements_list ml),
trimmed_values as (select id, case when name="aaa" then value else null end as aaa, case when name="bbb" then value else null end as bbb
from param_values where name in ("aaa", "bbb"))
select id, max(aaa) aaa, max(bbb) bbb from trimmed_values group by id
JSONPath method
I can also use full-featured JSONPath
function as suggested by Mikhail. Then things start looking more manageable:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb
from `sap-clm-analytics-dev.ag_experiment.measurements`
(It may be less efficient then the CASE-WHEN-GROUP-BY method because of the external UDF call, but let's focus on maintainability for now).
Adding another list of values
Now I add another list of key-value pairs named colors
:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement
Let's pick values for green
from the list of colors. Then the output will be:
id | aaa | bbb | green |
---|---|---|---|
1 | 10 | 20 | A |
2 | 11 | 22 | AA |
3 | 111 | 222 | AAA |
The JSONPath solution from above can be trivially extended to cover this case:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.colors), '$.[?(@.name=="green")].value') bbb
from measurements
With the CASE-WHEN approach things starts getting tricky. The below query already gets complex and is simply wrong:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement),
parameters_colors AS (
select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) parameters_list, JSON_QUERY_ARRAY(measurements.measurement.colors) colors_list from measurements),
param_color_values AS (select id, JSON_VALUE(parameters_list.name) param_name, JSON_VALUE(parameters_list.value) param_value, JSON_VALUE(colors_list.name) color_name, JSON_VALUE(colors_list.value) color_value from parameters_colors, parameters_colors.parameters_list, parameters_colors.colors_list),
trimmed_values AS (select id,
case when param_name="aaa" then param_value else null end as aaa,
case when param_name="bbb" then param_value else null end as bbb,
case when color_name="green" then color_value else null end as green,
from param_color_values where param_name in ("aaa", "bbb") and color_name = "green")
select id, max(aaa) aaaa, max(bbb) bbb, max(green) green from trimmed_values group by 1
Wrong result:
id | aaa | bbb | green |
---|---|---|---|
1 | 10 | 20 | A |
2 | 10 | 20 | AA |
3 | 10 | 20 | AAA |
The cartesian product in param_color_values
is fine, but trimmed_values
incorrectly fill permutations with nulls. Apparently the level of dependency is needed for "green" values.
It would be apparently possible to fix my example, but it probably won't be maintainable after another list of parameters. So, I want to phrase my question differently.
Question
What would be a maintainable way to extract multiple values from such data structures in SQL?
Materialized view
Ideally, I'd like to persist such query as a BigQuery materialized view. The original data object is huge, so I want to create a stage in the data pipeline, which persists a curated subset of it, differently clustered. I want that the BigQuery manages the refreshes of this object. Materialized view has a limited set of functions. For examples, UDFs (like CUSTOM_JSON_PATH) are not supported.
My current state
I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.
Do I oversee any trivial pure SQL solution, which is optionally materialized-view compatible and easy to extend to more complex cases?