I'm trying to obtain a random sample of N
rows from Athena. But since the table from which I want to draw this sample is huge the naive
SELECT
id
FROM mytable
ORDER BY RANDOM()
LIMIT 100
takes forever to run, presumably because the ORDER BY
requires all data to be sent to a single node, which then shuffles and orders the data.
I know about TABLESAMPLE
but that allows one to sample some percentage of rows rather than some number of them. Is there a better way of doing this?