0

Given percentages table which has a list of categories and their desired percentages, and raw data table (with item_id & category fields), how can I create a random sample of N items (e.g. N = 10) from the data table while ensuring the percentage of items for each category in the final sample is distributed according to percentages table. I.e. random sample with particular categories distribution.

Here we are basically trying to reproduce category distribution from a particular data table in a smaller sample.

with percentages as (
select 'A' as category, 0.3 as category_percentage UNION ALL
       'B' , 0.5 UNION ALL
       'C' , 0.2 
),

data as (
       Select 1 as item_id, 'A' as category UNION ALL
       Select  2, 'B' UNION ALL
       Select  3, 'B' UNION ALL
       Select  4, 'C' UNION ALL
       Select  5, 'A' UNION ALL
       Select  6, 'C' UNION ALL
       Select  7, 'B' UNION ALL
       Select  8, 'B' UNION ALL
       Select  9, 'B' UNION ALL
       Select  10, 'A' UNION ALL
       Select  11, 'A' UNION ALL
       Select  12, 'C' UNION ALL
       Select  13, 'C' UNION ALL
       Select  14, 'B' UNION ALL
       Select  15, 'A' UNION ALL
       Select  16, 'C' UNION ALL
       Select  17, 'A' UNION ALL
       Select  18, 'B'  
)

So in example above for N =10 it would create a sample with ten items where 30% (3 items) would belong to category A, 50% (5 items) to category B and 20% (2 items) to category C.

example output (random sample of 10 items with categories distributed based on percentages table):

  item_id, category
    5     'A'
    10    'A'
    15    'A'
    3     'B'
    8     'B'
    14    'B'
    10    'B'
    18    'B'
    6     'C'
    13     'C'

And related question, if each item in data also has a score field, how can you sample top scored items from data table (i.e. pick top ranked instead of random items) while keeping the category distribution from percentages table

This question is related but in our case the percentages for each category are coming from a separate table Stratified random sampling with BigQuery?

alex
  • 1,757
  • 4
  • 21
  • 32

1 Answers1

1

For a given sample size (e.g. 10) calculate from that the number of rows for each category using the percentages. Then, use row_number() over(partition by category order by newid()) (note you will need to use RAND() for bigquery) to randomly associate a number to each row within a category starting at 1. Then finally select the wanted number of rows per category calculated via the percentages.

WITH percentages AS (
    SELECT 'A' AS category, 0.3 AS category_percentage UNION ALL
    SELECT 'B', 0.5 UNION ALL
    SELECT 'C', 0.2
),
data AS (
    SELECT 1 AS item_id, 'A' AS category UNION ALL
    SELECT 2, 'B' UNION ALL SELECT 3, 'B' UNION ALL SELECT 4, 'C' UNION ALL SELECT 5, 'A' UNION ALL SELECT 6, 'C' UNION ALL SELECT 7, 'B' UNION ALL SELECT 8, 'B' UNION ALL SELECT 9, 'B' UNION ALL SELECT 10, 'A' UNION ALL SELECT 11, 'A' UNION ALL SELECT 12, 'C' UNION ALL SELECT 13, 'C' UNION ALL SELECT 14, 'B' UNION ALL SELECT 15, 'A' UNION ALL SELECT 16, 'C' UNION ALL SELECT 17, 'A' UNION ALL SELECT 18, 'B'
),
sample_size AS (
    -- Define the sample size here.
    -- In this example, the sample size is set to 10.
    SELECT 10 AS sample_size
),
category_counts AS (
    -- Calculate the number of rows to select from each category based on the sample size and the percentages defined in the `percentages` CTE.
    SELECT p.category,
           CEILING(s.sample_size * p.category_percentage) AS num_rows_to_select
    FROM percentages p CROSS JOIN sample_size s
),
seqnums AS (
    -- Assign a random sequence number to each row within each category using the `ROW_NUMBER()` function.
    SELECT d.category,
           d.item_id,
           ROW_NUMBER() OVER (PARTITION BY d.category ORDER BY NEWID()) AS seqnum,
           c.num_rows_to_select
    FROM data d JOIN category_counts c ON d.category = c.category
)
SELECT d.*
FROM data d JOIN seqnums s ON d.category = s.category AND d.item_id = s.item_id
WHERE s.seqnum <= s.num_rows_to_select
ORDER BY d.category
item_id category
17 A
15 A
11 A
3 B
18 B
2 B
8 B
14 B
16 C
4 C

fiddle

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • It does seem counter-intuitive to ask for preset sample percentages, but there I assume is good reasons for it. Glad I could be of assistance. – Paul Maxwell Mar 25 '23 at 03:15
  • yes, the reason is i'm trying to reproduce category distribution from one sample in another, smaller one – alex Mar 25 '23 at 17:47
  • I should've made example more precise by calculating the percentages from raw data table above (of size 18) – alex Mar 25 '23 at 17:55
  • as a separate question how would i go about adding second strata, say in addition to 'category' field in data table, there is also 'label' field which distribution we want to also preserve, i.e. stratified sample from both category and label – alex Mar 25 '23 at 17:57
  • just found this suggestion, "create a new column that is a concatenation of the values in your other columns and stratify on the new column." https://stackoverflow.com/a/51525992/165130 – alex Mar 25 '23 at 18:04
  • add extra column to your percentages table, distribute the percentages accordingly, then adjust joins so both category and label are used. i.e. just extend the logic you already have. BUT if your total population is 18, and the wanted sample size is 10 not much randomness is going to be achieved – Paul Maxwell Mar 25 '23 at 18:46
  • 1
    thanks, the size of actual data table is in the millions so that should work. i found that concatenating category with label column, computing percentages and then using that as new category works fine – alex Mar 26 '23 at 17:46
  • i.e. without even changing the logic, so the new category is simply the combined "categoryA_label7", so the only change is in percentages table – alex Mar 26 '23 at 17:53