1

I would like to have a table with a JSON column. This JSON column can include arbitrary documents. I would like to merge these documents according to their timestamp which is available in another column. Is there a way to merge these JSON documents by their timestamp?

Here is an example:

at t3 time {a:"1", b:"2"}
at t2 time {b:"4"}
at t1 time {a:"4", c:"5"}

I want to create {a:"1", b:"2", c:"5"} as output. Is it possible to this in BigQuery?

Cemo
  • 5,370
  • 10
  • 50
  • 82

3 Answers3

3

Below is for BigQuery Standard SQL

#standardSQL
SELECT STRING_AGG(y, ', ' ORDER BY y) json
FROM (
  SELECT STRING_AGG(TRIM(x) ORDER BY t DESC LIMIT 1) y
  FROM `project.dataset.table`, 
    UNNEST(SPLIT(REGEXP_REPLACE(json, r'{|}', ''))) x
  GROUP BY TRIM(SPLIT(x, ':')[OFFSET(0)]) 
)   

Note: above solution generic and does not require knowing in advance attribute names (like a, b, etc.) rather it parse and extract whatever it will find. Obviously it relies on assumption of simple jsons like in your examples

You can test, play with above using sample data from your question as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '{"a":"1", "b":"2"}' json, 3 t UNION ALL
  SELECT '{"b":"4"}', 2 UNION ALL 
  SELECT '{"a":"4", "c":"5"}', 1  
)
SELECT STRING_AGG(y, ', ' ORDER BY y) json
FROM (
  SELECT STRING_AGG(TRIM(x) ORDER BY t DESC LIMIT 1) y
  FROM `project.dataset.table`, 
    UNNEST(SPLIT(REGEXP_REPLACE(json, r'{|}', ''))) x
  GROUP BY TRIM(SPLIT(x, ':')[OFFSET(0)]) 
)

with result

Row json     
1   "a":"1", "b":"2", "c":"5"     

Because (as I mentioned) it is generic enough - you can add rows with more attributes w/o changing code like in below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '{"a":"1", "b":"2"}' json, 3 t UNION ALL
  SELECT '{"b":"4"}', 2 UNION ALL 
  SELECT '{"a":"4", "c":"5"}', 1 UNION ALL
  SELECT '{"abc":"1", "xyz":"2"}', 3 UNION ALL
  SELECT '{"abc":"3", "vwy":"4"}', 3  
)
SELECT STRING_AGG(y, ', ' ORDER BY y) json
FROM (
  SELECT STRING_AGG(TRIM(x) ORDER BY t DESC LIMIT 1) y
  FROM `project.dataset.table`, 
    UNNEST(SPLIT(REGEXP_REPLACE(json, r'{|}', ''))) x
  GROUP BY TRIM(SPLIT(x, ':')[OFFSET(0)]) 
)

with result

Row json     
1   "a":"1", "abc":"1", "b":"2", "c":"5", "vwy":"4", "xyz":"2"   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
2

Here is a possible solution using BigQuery standard SQL functions, with your data:

#standardSQL
WITH test AS (
  SELECT '{"a":"1", "b":"2"}' AS json, 3 AS t UNION ALL
  SELECT '{"b":"4"}' AS json, 2 AS t UNION ALL 
  SELECT '{"a":"4", "c":"5"}' AS json, 1 AS t 
)
SELECT data_row, TO_JSON_STRING(data_row) AS json_row
FROM (
  SELECT 
    ARRAY_TO_STRING(ARRAY_AGG(a IGNORE NULLS ORDER BY t DESC LIMIT 1),'') AS a,
    ARRAY_TO_STRING(ARRAY_AGG(b IGNORE NULLS ORDER BY t DESC LIMIT 1),'') AS b,
    ARRAY_TO_STRING(ARRAY_AGG(c IGNORE NULLS ORDER BY t DESC LIMIT 1),'') AS c
  FROM(
    SELECT JSON_EXTRACT_SCALAR(json,'$.a') AS a,
    JSON_EXTRACT_SCALAR(json,'$.b') AS b,
    JSON_EXTRACT_SCALAR(json,'$.c') AS c, 
    t
    FROM test
  )
) AS data_row

Notice that ARRAY_AGG is used only to find the latest no NULL value for each document, so it is transformed to STRING with ARRAY_TO_STRING. The result of this query is, which should be desired:

Row data_row.a  data_row.b  data_row.c  json_row     
1   1           2           5           {"a":"1","b":"2","c":"5"}   

The problem with this query is that you have to specify all the documents (in this case, a,b,c).

enle lin
  • 1,664
  • 8
  • 14
1

There might be a better way, while the first idea occurred to me is:

  1. Aggregate multiple json string using STRING_AGG with unique separator (like in below example: '||||||')
  2. Use JavaScript UDF to parse JSON / merge / output to string.
#standardSQL
CREATE TEMPORARY FUNCTION merge_json(json_string STRING)
RETURNS STRING
LANGUAGE js
AS 
"""
  // TODO 1: split json string with '||||||' to get multiple parts
  // .    2: parse each json parts into object
  //      3: merge objects in your own way

  // fake output, just to demonstrate the idea
  var obj = JSON.parse('{"a":"1", "b":"2", "c":"5"}')
  return JSON.stringify(obj);
""";
WITH
  sample_data AS (
  SELECT '{a:"1", b:"2"}' AS json, 1000 AS timestamp
  UNION ALL
  SELECT '{b:"4"}' AS json, 2000 AS timestamp
  UNION ALL 
  SELECT '{a:"4", c:"5"}' AS json, 1000 AS timestamp )

SELECT timestamp, merge_json(STRING_AGG(json, '||||||')) as joined_json
FROM sample_data
GROUP BY timestamp

Output:

output

Yun Zhang
  • 5,185
  • 2
  • 10
  • 29
  • very nice, more example of similar solution can be found in this question https://stackoverflow.com/q/52120182/1031958 – Tamir Klein Mar 22 '19 at 08:41
  • **Fake** output? **Fake** answers? Really - what is nice about it? Curious about this type of the answers! I saw few of such already – Mikhail Berlyant Mar 22 '19 at 11:34
  • I feel sorry about @MikhailBerlyant seems offended by a "fake" answer. While the consideration behind this answer is, it opens a door to handle much more complicated input with more complicated aggregation method. The original question mentioned it is to deal with "arbitrary documents" which can easily be nested data structure. REGEXP_REPLACE is going to have hard time to go further. – Yun Zhang Mar 22 '19 at 17:05