0

I have a large table with rows that have a 'tag' column, structured as such.

id tag data
1 tag_2 ...
2 tag_1 ...
3 tag_4 ...
... ... ...
n tag_2 ...

The tags have different percentage representation. Is there a good way to get a random subset of rows that has at least one of each tag, and where the tags are represented proportionally to the tags themselves?

My current approach was to sort the rows by tag, and to take every nth row, but some tags were being accidentally skipped due to not having a very high representation of rows. My current query is:

SELECT t.id, t.tag
FROM
(
    SELECT id, tag, ROW_NUMBER() OVER (ORDER BY tag) AS rownum
    FROM datatable
) AS t
WHERE t.rownum % 100 = 0    -- can be any number

(I am currently using BigQuery, but can also use postgres for this query.)

Londala
  • 95
  • 5
  • Can only one tag appear in the tag column for each row? If so, then just `select id, tag from datatable order by random() limit 100000` should return a proportion very close to all data. If you have some extremely rare tags, then you may need to oversample with a UNION, depending what you're trying to accomplish. – Isolated Aug 16 '23 at 18:19
  • I just found out this great post: https://stackoverflow.com/questions/13040246/select-random-row-from-a-postgresql-table-with-weighted-row-probabilities maybe it can inspire you. – jian Aug 17 '23 at 01:15

0 Answers0