1

I am looking for a solution to aggregate a Bigquery table with about 100 columns using sum() function. But the following query is not allowed in Bigquery standard SQL.

select sum(*)
from `dataset.Intermediate_Tables.eventCat_dummies`
group by Id

The reason I want to do this kind of aggregation in Bigquery is due to its ability to handle large amount of data. I tried to do the same aggregation in jupyter notebook but it failed everytime. It might be because of the large size of data (7.3 GiB csv file). The code I tried is like the following:

df_type = type_dummies.groupby('Id', sort=False).sum()

Anyone can give any suggestions and/or alternatives that how I can get aggregated data of this large dataset?

UPDATE WITH SAMPLE INPUT AND OUTPUT

Input data

Id col1 col2 col3 col4
1  0    0    0    1
2  0    1    1    1
1  1    0    0    0
4  0    0    0    0
19 0    0    0    0
2  1    1    1    1

Desired output

Id col1_sum col2_sum col3_sum col4_sum
1  1        0        0        1
2  1        2        2        2
4  0        0        0        0
19 0        0        0        0

In my original dataset, there are 100 columns and 40 million rows.

cloudexplorer
  • 55
  • 1
  • 8

1 Answers1

4

Below is example for BigQuery StandardSQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
  SELECT 1, 4, 5, 6 UNION ALL
  SELECT 2, 7, 8, 9
)
SELECT id, 
  SUM((
    SELECT SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64)) 
    FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''))) pair 
    WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
  )) val
FROM `project.dataset.table` t
GROUP BY id    

with result as

Row id  val  
1   1   21   
2   2   24     

As you can see - above has no dependency on number of columns
Also, it assumes all columns except of id column needs to be SUMed. if you have more columns to be excluded - you can adjust WHERE clause respectively

Update based on provided details
So you want to sum each and every individual column (initially I read your question as if you want to sum all the column's values together by id)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
  SELECT 1, 4, 5, 6 UNION ALL
  SELECT 2, 7, 8, 9
)
SELECT id, 
  REPLACE(SPLIT(pair, ':')[SAFE_OFFSET(0)], '"', '')  col,
  SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64)) val
FROM (
  SELECT id, 
    ARRAY(
      SELECT pair
      FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t1), r'[{}]', ''))) pair 
      WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
    ) arr
  FROM `project.dataset.table` t1
) t2,
UNNEST(arr) pair
GROUP BY id, col
ORDER BY id, col

this will give you below result

Row id  col val  
1   1   a   5    
2   1   b   7    
3   1   c   9    
4   2   a   7    
5   2   b   8    
6   2   c   9    

this result is flattened version of what you need and in most practical use cases is much more efficient than pivoted one

Still, if you want to pivot this result - see https://stackoverflow.com/a/35808045/5221944

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Hi Mikhail, I appreciate your help. And I am trying to understand your solution. Then several questions come to my mind. First question is that: does not the column names are implicitly involved when you use **WITH** and **UNNEST** clauses in the solution? And the second question is that how the 21 is calculated? If my question is too naive, I hope you won't mind. By the way, I updated the question with a sample input and desired output. Could you let me know if this solution also will still hold? Thanks again. – cloudexplorer Jun 11 '18 at 16:12
  • see update in answer. meantime - 1) WITH is used just to mimic your data - you can just remove that part and use your own `project.dataset.table` and it will work for as many columns you have in your table and UNNEST - as you can see from script/query - does not use column names at all! 2) 21 is the sum of all columns for id=1 (1+2+3+4+5+6=21) - of course this is how I understood your question initially - but after you provided more details - it is clear that not what you wanted - so I added update to my initial answer. hope this helped :o) – Mikhail Berlyant Jun 11 '18 at 16:37
  • Thank you for sharing your precious knowledge, Mikhail. Your approach to the solution and explanation look excellent to me. I can learn a lot from it. Although I have not tried to apply your idea to my dataset, I like your solution and am sure it will lead me to the right place. I will upvote it and hope it could help more people who come across similar problems. – cloudexplorer Jun 11 '18 at 16:59