I would like to get a control group for an academic social media study of people with similar interests as my target group. I am using the reddit dataset on BigQuery. I have full user histories (posts and comments) for my target group (chosen based on specific phrases), along with some other info including which subreddit they posted on.
Thus, now I have a column with how often subreddits were posted in for the target group, e.g.
- 1000 AskWomen
- 300 vegan
- 80 confession
- etc.
Say this column would be 2000 in counts total, then I'd like to select approximately 1000/2000 -> 50% random authors from AskWomen, 300/2000 -> 15% from vegan, etc. The actual number of authors I would like to retrieve would be around 10x the size of the target group (not yet determined). I could hardcode this of course, but I expect there to be a very large number of subreddits which makes it impractical.
The closest thing to it that I found was: Stratified random sampling with BigQuery? But as far as I understand this only works with a column within the same table you query. Does anyone know how to do this?
Update: After lots of trying I realized this is not quite what I need anyway. In case anyone stumbles upon this. You can basically apply the solution from the link, except you need table stats for both groups ('target' and where you sample from) so you can reweigh the probabilities for random selection to fit the target distribution.
I lost the code, but that is the idea. It got me the right distribution although not quite the right numer of samples, unsure why.