I have a table in Snowflake. One of the columns in the table is called obj_key
(object key).
The table size is very large (in TBs), so performance is desirable.
Now, a new entry is added to the table everytime an update on an object is done. The newly inserted row has the same obj_key
, but a different entry in the time_modified
column. Suppose I want to fetch distinct obj_key
from the table, subject to certain conditions.
There are three approaches which I have:
Approach 1:
SELECT obj_key
FROM my_table
WHERE some_condition
GROUP BY obj_key;
Approach 2:
SELECT distinct(obj_key)
FROM my_table
WHERE some_condition;
Approach 3:
SELECT obj_key
FROM my_table
WHERE some_condition
QUALIFY ROW_NUMBER() OVER (PARTITION BY obj_key ORDER BY obj_key) = 1;
So essentially my question boils down to these:
I had read that distinct
on multiple columns is performed by group_by(col1, col2, ..., col n)
. So how does the performance differ on the two (if it does)?
Since PARTITION BY
also requires an ORDER BY
, does it not drastically reduce the performance?
I'd love if someone could provide the nitty-gritties of how these queries run on SnowFlake.