random sample of size N in Athena

Question

I'm trying to obtain a random sample of N rows from Athena. But since the table from which I want to draw this sample is huge the naive

SELECT
id
FROM mytable
ORDER BY RANDOM()
LIMIT 100

takes forever to run, presumably because the ORDER BY requires all data to be sent to a single node, which then shuffles and orders the data.

I know about TABLESAMPLE but that allows one to sample some percentage of rows rather than some number of them. Is there a better way of doing this?

What type of connector are you using? On a hive connector, I get slightly different rows each time I run a simple `SELECT * FROM t LIMIT 10`. It is biased towards newer data, I assume because a different node wins the "race" to return results each time. How unbiased does your sample need to be? — Dave Cameron, Jul 15 '17 at 23:34

score 55 · Accepted Answer · edited Sep 09 '20 at 21:34

55

Athena is actually behind Presto. You can use TABLESAMPLE to get a random sample of your table.

Lets say you want 10% sample of your table, your query will be something like:

SELECT id FROM mytable TABLESAMPLE BERNOULLI(10)

Pay attention that there is BERNOULLI and SYSTEM sampling. Here is the documentation for it.

edited Sep 09 '20 at 21:34

user359996

5,533
4
33
24

answered Nov 06 '17 at 09:24

Itay Kahana

2,355
1
26
20

2

Athena supports only `BERNOULLI` sampling. Console test shows that `TABLESAMPLE SYSTEM` is a no-op. – Dave Kielpinski Jun 29 '20 at 22:32
@DaveKielpinski Does no-op mean it doesn't require any extra processing? – wordsforthewise Feb 22 '22 at 16:34

random sample of size N in Athena

1 Answers1