This should be a simple extension of this question, but my result is not correct and I can't figure it out. I'd like the proportions in the table I'm drawing from to match the proportions of another table. I'd also like to have it stratified by two categories. I think it should be something like:
WITH table AS (
SELECT *
FROM `another_table` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT cat1, cat2, COUNT(*) c
FROM table
GROUP BY cat1, cat2
HAVING c>1000000)
)
SELECT COUNT(*) samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, cat1, cat2, c
FROM table `fh-bigquery.reddit_comments.2018_09`
JOIN table_stats b
USING(cat1, cat2)
WHERE RAND()< 1000/total
)
GROUP BY 2, 3
This should give about 1000 rows, but the result is much higher, and the percentage calculation is off.