Selecting first row using row_number()

Question

I'm wondering about the performance of ROW_NUMBER(): In my test case, I have many elements with large number (~100k) entries in each element. I would like to select the first entry for each element:

SELECT * FROM(    
    SELECT 
      element, msg, timestamp, 
      ROW_NUMBER() OVER(PARTITION BY element ORDER BY timestamp) as rank
    FROM table
) t1
WHERE rank = 1

It seems excessive to me to need to order all 100k rows for each partition just to keep the first one. Something like a MIN() function should naively perform better. So I tried something like this:

SELECT
  element, MIN(a)[0] as timestamp, MIN(a)[1] as msg
FROM(
    SELECT element, ARRAY(timestamp, msg) as a
    FROM table
) t1
GROUP BY partition

~~But when comparing the CPU usage for both methods, I'm getting identical performance. Is the ROW_NUMBER() optimizer much smarter than I thought, or am I missing something?~~

EDIT: Had a bug in my original query. Looking at the results now, it appears like the MIN(ARRAY) method is considerably faster than the ROW_NUMBER() filter. Is there an easier canonical way of implementing a FIRST_ROW OVER( PARTITION) filter without having to define a massive array and then deconstruct it?

Perhaps use a cross apply with a seelect top 1 ordered by timestamp? — xQbert, Jun 19 '17 at 18:34
@xQbert Wouldn't the order by timestamp happen on all rows and only then the top 1 selection be applied? — moriartyj, Jun 19 '17 at 18:52
No not on a cross apply. it's a table valued function not a join which takes the top row when related by the element. so only the top 1 (oldest) timestamp of each element would be joined. https://stackoverflow.com/questions/9275132/real-life-example-when-to-use-outer-cross-apply-in-sql — xQbert, Jun 19 '17 at 19:05
sorry outer apply not cross: https://stackoverflow.com/questions/9275132/real-life-example-when-to-use-outer-cross-apply-in-sql — xQbert, Jun 19 '17 at 19:08

score -1 · Answer 1 · answered Jun 19 '17 at 16:20

You are missing something. The GROUP BY takes a lot of work -- seemingly similar to the ROW_NUMBER().

I am not intimately familiar with Hive's underlying mechanisms. In most databases, ROW_NUMBER() would take a physical identifier for the row, the various keys, and sort them separately. Using the physical identifier, the value can then be added back quickly to the original data. The detail, of course, might vary by database, but this is the general idea.

Hive probably does not do this. However, whatever it does would be similar to a GROUP BY in terms of effort.

Selecting first row using row_number()

1 Answers1