1

My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)

It looks like this

enter image description here

How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?

I have tried

    select column_name
    from my_schema.INFORMATION_SCHEMA.COLUMNS
    where
        table_name = 'my_table
        

But it will only list first level keys

From the picture above, I want, as a first step, a SQL query that returns

message
user_id
seeker 
liker_id 
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt 
...

My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]

The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.

Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object

EDIT:

Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?

Example input (properties.x are dynamic and not known by the programmer)

search_id properties.filters.school properties.filters.type
1 MIT master
2 Princetown null
3 null master

Example output

search_id enabled_filters
1 ["school", "type"]
2 ["school"]
3 ["type"]
Cyril Duchon-Doris
  • 12,964
  • 9
  • 77
  • 164
  • You can use `TO_JSON_STRING` to convert your struct to a string. Then apply the UDF in the link you provided in your question. `Select jsonObjectKeys(TO_JSON_STRING(properties))` – Samuel Nov 08 '22 at 09:59
  • @Samuel this is most likely not very efficient right ? I have also extended my question to add my end goal, which it to get the list of "enabled properties" on each record (and actually group/count my requests by "enabled filters" so I can see what filters my users use in the searchbar – Cyril Duchon-Doris Nov 08 '22 at 10:06
  • Ok I have realized that I can most likely find what I need when displaying the data_type column in INFORMATION_SCHEMA.tables. But then I need to "parse" the data_type definition to be able to extract that which I need. It seems a bit tricky especially when considering deeply nested keys/arrays – Cyril Duchon-Doris Nov 08 '22 at 12:53
  • 1
    Have you looked at COLUMN_FIELD_PATHS? select field_path from `my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` where table_name = '' https://cloud.google.com/bigquery/docs/information-schema-column-field-paths
    – Nagesh Susarla Nov 09 '22 at 20:40
  • Awesome thank you @NageshSusarla this is exactly what I've been looking for ! If you make your comment as an answer I'll upvote and accept it. I just need to apply a string regex to filter the output for strings matching properties.filters.xx (and exclude nested properties.filters.xx.(value|__st) generated by BigQuery) and it should work as desired – Cyril Duchon-Doris Nov 10 '22 at 14:28
  • Added the comment as an answer – Nagesh Susarla Nov 16 '22 at 21:42

2 Answers2

2

The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.

CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
  var out=[]
  for(let x in input){
    let te=input[x];
    out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x]  );
  }
  return  out;
  Object.keys(JSON.parse(input));

}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
 union all select struct(null,struct(null,1,[999])) ) 
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu

 from tbl
Samuel
  • 2,923
  • 1
  • 4
  • 19
  • As I tested it, one million rows take about 50 seconds. – Samuel Nov 08 '22 at 13:06
  • Sorry for the previous comment, there was a problem with my query and it took less time after I fixed it. However I cannot get my nested filters to appear in your "output string" as shown in this screenshot : https://ibb.co/zVJQRjW – Cyril Duchon-Doris Nov 08 '22 at 13:19
  • And when I use "properties.filters" in your function it gives the following : https://ibb.co/5KH0Ytg – Cyril Duchon-Doris Nov 08 '22 at 13:21
  • The column name is `properties.filters.sectors.value` this is always giving the last item only. I added an option to include the full name. – Samuel Nov 08 '22 at 19:46
2

Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.

select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'

[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]

Nagesh Susarla
  • 1,660
  • 10
  • 10