I'm wondering about the performance of ROW_NUMBER(): In my test case, I have many elements with large number (~100k) entries in each element. I would like to select the first entry for each element:
SELECT * FROM(
SELECT
element, msg, timestamp,
ROW_NUMBER() OVER(PARTITION BY element ORDER BY timestamp) as rank
FROM table
) t1
WHERE rank = 1
It seems excessive to me to need to order all 100k rows for each partition just to keep the first one. Something like a MIN() function should naively perform better. So I tried something like this:
SELECT
element, MIN(a)[0] as timestamp, MIN(a)[1] as msg
FROM(
SELECT element, ARRAY(timestamp, msg) as a
FROM table
) t1
GROUP BY partition
But when comparing the CPU usage for both methods, I'm getting identical performance. Is the ROW_NUMBER() optimizer much smarter than I thought, or am I missing something?
EDIT: Had a bug in my original query. Looking at the results now, it appears like the MIN(ARRAY) method is considerably faster than the ROW_NUMBER() filter. Is there an easier canonical way of implementing a FIRST_ROW OVER( PARTITION) filter without having to define a massive array and then deconstruct it?