Stratified random sample from column with standard SQL in BigQuery

Question

I would like to get a control group for an academic social media study of people with similar interests as my target group. I am using the reddit dataset on BigQuery. I have full user histories (posts and comments) for my target group (chosen based on specific phrases), along with some other info including which subreddit they posted on.

Thus, now I have a column with how often subreddits were posted in for the target group, e.g.

1000 AskWomen
300 vegan
80 confession
etc.

Say this column would be 2000 in counts total, then I'd like to select approximately 1000/2000 -> 50% random authors from AskWomen, 300/2000 -> 15% from vegan, etc. The actual number of authors I would like to retrieve would be around 10x the size of the target group (not yet determined). I could hardcode this of course, but I expect there to be a very large number of subreddits which makes it impractical.

The closest thing to it that I found was: Stratified random sampling with BigQuery? But as far as I understand this only works with a column within the same table you query. Does anyone know how to do this?

Update: After lots of trying I realized this is not quite what I need anyway. In case anyone stumbles upon this. You can basically apply the solution from the link, except you need table stats for both groups ('target' and where you sample from) so you can reweigh the probabilities for random selection to fit the target distribution.

I lost the code, but that is the idea. It got me the right distribution although not quite the right numer of samples, unsure why.

Can you try to apply the solution from the other thread it and post the query here? — Martin Weitzmann, May 22 '20 at 08:00
You could use Rand() over tables with only one type of subreddit, since you would know the percentage for each. I believe that would be a bit tiring, but would this work for you? — Alexandre Moraes, May 22 '20 at 10:01
@AlexandreMoraes I hoped there would be a smarter way, but yeah, I do think it would work in principle, thanks! — Kwelleheli, May 25 '20 at 11:30
@MartinWeitzmann I will. Last time I couldn't get anything sensible yet, but I'll give it another go — Kwelleheli, May 25 '20 at 11:31
In case you need help creating your query, you can update your question with some data. So I can help you, is it Ok? — Alexandre Moraes, May 25 '20 at 11:53

Stratified random sample from column with standard SQL in BigQuery

0 Answers0