BigQuery GROUP BY vs. PARTITION BY when appending the auxiliary columns

Question

To retrieve the supplementary information of primary keys, I came up with two ways of doing this.

One is to use GROUP BY with aggregation function [1] and the other is to use PARTITION BY and a window/analytic function (~ row_number) [2].

I am suspecting GROUP BY to be the lower performance way due to the aggregation function (might not be true since the PARTITION BY way uses subquery), however, I am not sure which would be the better way of appending supplementary columns.

Which would be the better way in terms of time complexity and performance?
Also, is there a better way of appending the auxiliary columns?

Example case: to retrieve the users' last login info (such as ip, country_code, ...)

Possible Solution 1: GROUP BY with an aggregation function

#standardSQL
SELECT
    user_id,  -- primary key
    ARRAY_AGG(
        STRUCT(
            ip, country_code, os, channel_type
        )
        ORDER BY login_timestamp DESC LIMIT 1
    )[OFFSET(0)].*
FROM user_login_info_table
GROUP BY user_id

Possible Solution 2: PARTITION BY with a subquery

#standardSQL
WITH
login_log AS (
    SELECT
        user_id,
        ip, country_code, os, channel_type,
        ROW_NUMBER() OVER (
            PARTITION BY user_id ORDER BY login_timestamp DESC
        ) AS rn
    FROM user_login_info_table
)
SELECT user_id, ip, country_code, os, channel_type
FROM login_log
WHERE rn = 1

c.f. The above example is an oversimplified example, which ignored other aggregation fields.

References

I end up using `solution 1` when `GROUP BY` is needed with **other fields** that needed to be aggregated, and `solution 2` when no other aggregations are required. Thanks to the suggestion, I end up choosing queries with the empirical results by executing the samples. — Jiho Choi, Jun 29 '21 at 14:37

score 2 · Accepted Answer · answered Jun 25 '21 at 15:54

2

'Formal' Improvement for #2:

SELECT
    user_id,
    ip, country_code, os, channel_type
FROM user_login_info_table
WHERE true
QUALIFY 1 = ROW_NUMBER() OVER (
        PARTITION BY user_id ORDER BY login_timestamp DESC)

Which would be the better

In your oversimplified example - second solution makes more sense to me - but in more real use-case it depend
You always can run both version against your data and see execution plan/details to see how expensive each solution is

answered Jun 25 '21 at 15:54

Mikhail Berlyant

165,386
8
154
230

I don't think BigQuery supports the QUALIFY clause. [QUALIFY](https://stackoverflow.com/questions/44389057/qualify-equivalent-clause-in-bigquery-standard-sql) – Jiho Choi Jun 27 '21 at 09:05
1

It does! https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause – Mikhail Berlyant Jun 27 '21 at 14:38
I found out, thanks to the comment, that the `QUALIFY` clause was added on May 2021 as Pre-GA offerings. Thanks for the point-out and the link. [release notes](https://cloud.google.com/bigquery/docs/release-notes#May_10_2021) – Jiho Choi Jul 07 '21 at 08:22

score 0 · Answer 2 · answered Aug 02 '21 at 02:52

I tested multiple scenarios and end up with the below conclusion.

GROUP BY generally reduces the number of rows returned with aggregate functions in many cases. [1] On the other hand, PARTITION BY uses the WINDOW or ANALYTIC functions to return the same number of rows. Therefore, there is an advantage when it is possible to collect the results by specific keys and can reduce the output rows. When it is difficult to collect rows with GROUP BY keys, however, PARTITION BY is considered more advantageous as an overhead.

However, as mentioned in the above answer, it is better to test against the data.

EXAMPLE CASE 1:
using aggregation function and can reduce the results → GROUP BY

SELECT
    user_id,
    COUNT(DISTINCT logkey) as login_freq,  -- Agg. Function
    ARRAY_AGG(
        STRUCT(
            ip, country_code, os, channel_type
        )
        ORDER BY log_datetime DESC LIMIT 1
    )[OFFSET(0)].*,
FROM login_logs
GROUP BY user_id
;

EXAMPLE CASE 2:
filtering duplicate rows with logkey, when grouping with keys is relatively → PARTITION BY

SELECT user_id, ip, country_code, os, channel_type
FROM login_logs
WHERE TRUE
QUALIFY ROW_NUMBER() OVER (PARTITION BY logkey ORDER BY log_datetime) = 1
;

BigQuery GROUP BY vs. PARTITION BY when appending the auxiliary columns

Example case: to retrieve the users' last login info (such as ip, country_code, ...)

2 Answers2

Linked