BigQuery compare all the columns(100+) from two rows in a sinle table

Question

I have input table as below-

id	col1	col2	time
01	abc	001	12:00
01	def	002	12:10

Required output table-

id	col1	col2	time	diff_field
01	abc	001	12:00	null
01	def	002	12:10	col1,col2

I need to compare both the rows and find all the columns for which there is difference in value and keep those column names in a new column diff_field.

I need a optimized solution for this as my table has more than 100 columns(all the columns need to be compared)

make instructions more clear? are you comparing row against row and record which columns had different values in the latter row? if so why does your output table diff_field column second row does not have "time" as values in time column are different in row1 vs row2? — Yev Guyduy, Feb 07 '23 at 14:03

score 1 · Answer 1 · answered Feb 08 '23 at 01:25

Below approach has no dependency on actual columns' names or any names convention rather then only id and time

create temp function extract_keys(input string) returns array<string> language js as """
  return Object.keys(JSON.parse(input));
  """;
create temp function extract_values(input string) returns array<string> language js as """
  return Object.values(JSON.parse(input));
  """;
select t.*, 
  ( select string_agg(col)
    from unnest(extract_keys(cur)) as col with offset
    join unnest(extract_values(cur)) as cur_val with offset using(offset)
    join unnest(extract_values(prev)) as prev_val with offset using(offset)
    where cur_val != prev_val and col != 'time'
  ) as diff_field
from (
  select t, to_json_string(t) cur, to_json_string(ifnull(lag(t) over(win), t)) prev
  from your_table t
  window win as (partition by id order by time)
)

if apply to sample data in your question (or rather extended version of it that I borrowed from Jaytiger answer) - the output is

It works great for me with over 200+ columns; thank you! In combination with comparing tables in the attached answer, I was able to perform unit testing to check the consistency of my query and find the differences https://stackoverflow.com/a/57912743/7224607 — arielhasidim, Jun 06 '23 at 06:26

score 0 · Answer 2 · answered Feb 07 '23 at 14:59

You might consider below approach:

WITH sample_table AS (
  SELECT '01' id, 'abc' col1, '001' col2, '12:00' time UNION ALL
  SELECT '01' id, 'def' col1, '002' col2, '12:10' time UNION ALL
  SELECT '01' id, 'def' col1, '002' col2, '12:20' time UNION ALL
  SELECT '01' id, 'ddf' col1, '002' col2, '12:30' time
)
SELECT * EXCEPT(curr, prev),
       (SELECT STRING_AGG('col' || offset) 
          FROM UNNEST(SPLIT(curr)) c WITH offset
          JOIN UNNEST(SPLIT(prev)) p WITH offset USING (offset)
         WHERE c <> p AND offset < ARRAY_LENGTH(SPLIT(curr)) - 1
       ) diff_field
  FROM (
    SELECT *, FORMAT('%t', t) AS curr, LAG(FORMAT('%t', t)) OVER w AS prev
      FROM sample_table t
    WINDOW w AS (PARTITION BY id ORDER BY time)
  );

Query results

BigQuery compare all the columns(100+) from two rows in a sinle table

2 Answers2