Given percentages
table which has a list of categories and their desired percentages, and raw data
table (with item_id
& category
fields), how can I create a random sample of N items (e.g. N = 10) from the data
table while ensuring the percentage of items for each category in the final sample is distributed according to percentages
table. I.e. random sample with particular categories distribution.
Here we are basically trying to reproduce category distribution from a particular data table in a smaller sample.
with percentages as (
select 'A' as category, 0.3 as category_percentage UNION ALL
'B' , 0.5 UNION ALL
'C' , 0.2
),
data as (
Select 1 as item_id, 'A' as category UNION ALL
Select 2, 'B' UNION ALL
Select 3, 'B' UNION ALL
Select 4, 'C' UNION ALL
Select 5, 'A' UNION ALL
Select 6, 'C' UNION ALL
Select 7, 'B' UNION ALL
Select 8, 'B' UNION ALL
Select 9, 'B' UNION ALL
Select 10, 'A' UNION ALL
Select 11, 'A' UNION ALL
Select 12, 'C' UNION ALL
Select 13, 'C' UNION ALL
Select 14, 'B' UNION ALL
Select 15, 'A' UNION ALL
Select 16, 'C' UNION ALL
Select 17, 'A' UNION ALL
Select 18, 'B'
)
So in example above for N =10 it would create a sample with ten items where 30% (3 items) would belong to category A, 50% (5 items) to category B and 20% (2 items) to category C.
example output (random sample of 10 items with categories distributed based on percentages
table):
item_id, category
5 'A'
10 'A'
15 'A'
3 'B'
8 'B'
14 'B'
10 'B'
18 'B'
6 'C'
13 'C'
And related question, if each item in data
also has a score
field, how can you sample top scored items from data
table (i.e. pick top ranked instead of random items) while keeping the category distribution from percentages
table
This question is related but in our case the percentages for each category are coming from a separate table Stratified random sampling with BigQuery?