I have a large table with rows that have a 'tag' column, structured as such.
id | tag | data |
---|---|---|
1 | tag_2 | ... |
2 | tag_1 | ... |
3 | tag_4 | ... |
... | ... | ... |
n | tag_2 | ... |
The tags have different percentage representation. Is there a good way to get a random subset of rows that has at least one of each tag, and where the tags are represented proportionally to the tags themselves?
My current approach was to sort the rows by tag, and to take every nth row, but some tags were being accidentally skipped due to not having a very high representation of rows. My current query is:
SELECT t.id, t.tag
FROM
(
SELECT id, tag, ROW_NUMBER() OVER (ORDER BY tag) AS rownum
FROM datatable
) AS t
WHERE t.rownum % 100 = 0 -- can be any number
(I am currently using BigQuery, but can also use postgres for this query.)