SnowFlake's performance on group by vs partition on vs distinct

Question

I have a table in Snowflake. One of the columns in the table is called obj_key(object key). The table size is very large (in TBs), so performance is desirable.

Now, a new entry is added to the table everytime an update on an object is done. The newly inserted row has the same obj_key, but a different entry in the time_modified column. Suppose I want to fetch distinct obj_key from the table, subject to certain conditions.

There are three approaches which I have:

Approach 1:

SELECT obj_key 
FROM my_table
WHERE some_condition
GROUP BY obj_key;

Approach 2:

SELECT distinct(obj_key) 
FROM my_table
WHERE some_condition;

Approach 3:

SELECT obj_key
FROM my_table
WHERE some_condition
QUALIFY ROW_NUMBER() OVER (PARTITION BY obj_key ORDER BY obj_key) = 1;

So essentially my question boils down to these:

I had read that distinct on multiple columns is performed by group_by(col1, col2, ..., col n). So how does the performance differ on the two (if it does)?

Since PARTITION BY also requires an ORDER BY, does it not drastically reduce the performance?

I'd love if someone could provide the nitty-gritties of how these queries run on SnowFlake.

score 2 · Accepted Answer · answered May 26 '20 at 08:50

2

First two queries can be executed with same execution plan, based on cardinality expectation of Snowflake.

Your third approach will use a window function operator, and it would probably take more time.

As you have the dataset, I would HIGHLY recommend you to do your own tests, and observe the execution plans and the performance:

https://docs.snowflake.com/en/user-guide/ui-query-profile.html#how-to-access-query-profile

Actually, I did some tests with SNOWFLAKE_SAMPLE_DATA database, and I can see that first two queries are executed with same execution plan, and perform better than 3rd query.

answered May 26 '20 at 08:50

Gokhan Atil

9,278
2
11
24

Yes, I had been playing around with the query profile, and running different queries. However, I saw that the `group by` had similar run-time as that of `partition over`, which is why I asked this question. I also tried creating a `row number() over partition` as a new column `RN`, and then fetched only those rows with `RN=1`. This too, to my greatest surprise, had similar performance. – ghost May 26 '20 at 09:40
@ghost can you please tell me how big your test data set sample was. – Jai Prakash Aug 30 '23 at 03:49

SnowFlake's performance on group by vs partition on vs distinct

1 Answers1