1

Loading repeated fields in GBQ by importing a JSON file

By importing a JSON file with repeated records in BigQuery, you can create a table with nested repeated fields.

For example, for the schema:

[
{"type":"STRING", "name":"item"},
{"type":"RECORD", "name":"click", "mode":"REPEATED", "fields": [{"type":"TIMESTAMP", "name":"click_time"}, {"type":"STRING", "name":"userid"}]
}
]

you can load in a JSON file of clicks on an item, with clicks repeated for each item. The table would have fields item, click.click_time, and click.userid.

My question

Say you have a CSV file that has flattened the above JSON item-clicks, with one row per click, but repeated values for click and item. Can you load this into GBQ and convert it, with a GBQ query, to the equivalent table that you would have if you had loaded the JSON file with repeated fields?

The table resulting from the GBQ query on the imported CSV table should have item, click.click_time, click.userid as fields.

techraf
  • 64,883
  • 27
  • 193
  • 198
rmg
  • 1,009
  • 16
  • 31
  • Can you provide a sample of the file you want to load? – Felipe Hoffa Mar 03 '16 at 20:10
  • 1
    Mikhail beat me to it- see below for a sample file and a great answer. I was able to implement Mikhail's solution, as well as schemas involving multiple GROUP BY fields, like item, item_category1, item_category2, having associated clicks.click_time, clicks.userid. The UDF workaround Mikhail illustrates is more flexible than trying to use NEST, though as he mentions in another post it may lead to a higher billing level. – rmg Mar 04 '16 at 13:42

2 Answers2

4

With introduction of BigQuery Standard SQL we've got easy way to deal with records
Try below, Don't forget to uncheck Use Legacy SQL checkbox under Show Options

WITH YourTable AS (
  SELECT 'a1' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u1' AS userid UNION ALL
  SELECT 'a1' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u2' AS userid UNION ALL
  SELECT 'a1' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u3' AS userid UNION ALL
  SELECT 'a1' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u4' AS userid UNION ALL
  SELECT 'a2' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u1' AS userid UNION ALL
  SELECT 'a2' AS item,  '2016-03-03 19:52:23 UTC' AS click_time, 'u2' AS userid
)
SELECT item, ARRAY_AGG(STRUCT(click_time, userid)) AS clicks
FROM YourTable
GROUP BY item
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Nice update. Do you know any links to documentation of the BigQuery Standard SQL grammar and features? – rmg May 11 '16 at 10:16
3

Assume you have flatten data in your table :

item    click_time  userid   
a1  2016-03-03 19:52:23 UTC u1   
a1  2016-03-03 19:52:23 UTC u2   
a1  2016-03-03 19:52:23 UTC u3   
a1  2016-03-03 19:52:23 UTC u4   
a2  2016-03-03 19:52:23 UTC u1   
a2  2016-03-03 19:52:23 UTC u2

Below GBQ Query does what you ask for :
Please note: you need to write to table with 'Allow Large Result' and 'UnFlatten' options

SELECT *
FROM JS( 
  ( // input table 
    SELECT item, NEST(CONCAT(STRING(click_time), ',', STRING(userid))) AS clicks 
    FROM YourTable
    GROUP BY item
  ), 
  item, clicks, // input columns 
  "[ // output schema 
    {'name': 'item', 'type': 'STRING'},
     {'name': 'clicks', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'click_time', 'type': 'STRING'},
       {'name': 'userid', 'type': 'STRING'}
       ]    
     }
  ]", 
  "function(row, emit) { // function 
    var c = []; 
    for (var i = 0; i < row.clicks.length; i++) { 
      x = row.clicks[i].split(','); 
      t = {click_time:x[0], 
            userid:x[1]} ;
      c.push(t); 
    }; 
    emit({item: row.item, clicks: c}); 
  }"
) 

result is expected as below

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230