1

This should be a simple extension of this question, but my result is not correct and I can't figure it out. I'd like the proportions in the table I'm drawing from to match the proportions of another table. I'd also like to have it stratified by two categories. I think it should be something like:

WITH table AS (
  SELECT *
  FROM `another_table` a
), table_stats AS (
  SELECT *, SUM(c) OVER() total 
  FROM (
    SELECT cat1, cat2, COUNT(*) c 
    FROM table
    GROUP BY cat1, cat2
    HAVING c>1000000)
)

SELECT COUNT(*) samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, cat1, cat2, c  
  FROM table `fh-bigquery.reddit_comments.2018_09`
  JOIN table_stats b
  USING(cat1, cat2)
  WHERE RAND()< 1000/total
)
GROUP BY 2, 3

This should give about 1000 rows, but the result is much higher, and the percentage calculation is off.

Liam
  • 159
  • 11

1 Answers1

3

I think your rand() comparison is off:

WITH table AS (
      SELECT a.*
      FROM `another_table` a
     ),
    table_stats AS (
       SELECT cc.*, SUM(c) OVER () as total 
       FROM (SELECT cat1, cat2, COUNT(*) as c 
             FROM table
             GROUP BY cat1, cat2
             HAVING c > 1000000
            ) cc
      )
SELECT COUNT(*) as num_samples, cat1, cat2, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (SELECT id, cat1, cat2, c  
      FROM (select t.*, COUNT(*) OVER () as t_total,
                   COUNT(*) OVER (PARTITION BY cat1, cat2) as tcc_total
            from table `fh-bigquery.reddit_comments.2018_09` t
           ) t JOIN
           table_stats b
           USING (cat1, cat2)
      WHERE RAND() < (1000.0 / t.t_total) * (c / total) / (tcc_total / t_total)
     ) t
GROUP BY 2, 3;

Note that you need the total size of the second table to get the sample size (approximately) correct.

This is also random. If you really want a stratified sample, then you should do an nth sample on an order set. If that is of interest to you, then ask a new question, with appropriate sample data, desired results, and explanation.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786