Athena 1 billion rows table how to get top 1 row from each group without the resource error

Question

I have a table which contains 1 billion rows and would like to add top 1 row from each group to all rows. But I'm facing Query exhausted resources at this scale factor. error when I run the query.

MY_TABLE

+--------+---------+-------+
| COL_A  |  COL_B  | COL_C |
+--------+---------+-------+
| item_a | group_a |     1 |
| item_a | group_b |     2 |
| item_b | group_a |     1 |
+--------+---------+-------+

What I want to get

+--------+---------+-------+----------------+
| COL_A  |  COL_B  | COL_C | FIRST_ITEM_SEQ |
+--------+---------+-------+----------------+
| item_a | group_a |     1 |              1 |
| item_a | group_b |     2 |              1 |
| item_b | group_a |     1 |              1 |
+--------+---------+-------+----------------+

This is what I ran on Athena

SELECT ITEMS.*,
       FIRST_ITEMS.COL_C AS FIRST_ITEM_DATE
FROM   MY_TABLE ITEMS
       LEFT JOIN (SELECT COL_A, COL_B, MIN(COL_C)
                  FROM   MY_TABLE
                  GROUP BY COL_A, COL_B) FIRST_ITEMS
            ON    ITEMS.COL_A = FIRST_ITEMS.COL_A
            AND   ITEMS.COL_B = FIRST_ITEMS.COL_B
;

Could someone let me know better approach for this?

Could you please share the DDL for the table and indexes involved? The results from EXPLAIN would be really helpful as well — Frank Heikens, Apr 25 '23 at 12:09

score 0 · Answer 1 · answered Apr 25 '23 at 12:22

0

I would argue that the first thing you would want to avoid is to use joins on large amount of data. In this particular case you might want to try using window functions like:

select *,
   row_number() over (partition by COL_A, COL_B order by COL_C) as ITEM_SEQ 
from MY_TABLE

If you then will need only items with ITEM_SEQ then you can apply filtering using subquery/CTE:

select * 
from (select *,
         row_number() over (partition by COL_A, COL_B order by COL_C) as FIRST_ITEM_SEQ
     from MY_TABLE)
where FIRST_ITEM_SEQ = 1

answered Apr 25 '23 at 12:22

Guru Stron

102,774
10
95
132

Hi, thanks for the comment. I tried using row_number() before but it also caused a resource error so I ended up trying out the one on my question which also caused an error. – dkssud Apr 25 '23 at 12:39
@Jenny is your data partitioned? – Guru Stron Apr 25 '23 at 12:47
It's only partitioned by dt, but I can add the partition keys of the source table. The reason why I haven't added the partition keys was because it would add a lot of partition keys, and the cardinality of each partitions is pretty big. – dkssud Apr 25 '23 at 12:51
@Jenny check out also [this answer](https://stackoverflow.com/a/60903104/2501279) – Guru Stron Apr 25 '23 at 12:55
Thanks, I read the answer but it wouldn't solve the issue sadly. Do you know how we run sorting operations against the large table? – dkssud Apr 25 '23 at 14:13

Athena 1 billion rows table how to get top 1 row from each group without the resource error

1 Answers1