I'm looking to do a weighted join between two datasets:
library(tidyverse)
set.seed(1)
test.sample <- data.frame(zip=sample(1:3,50,replace = TRUE))
index.dat <- data.frame(zip=c(1,1,2,3,3,3),
fips=c("A1", "A2", "B", "C1", "C2","C3"),
prob=c(.75,.25,1,.7,.2,.1))
My expected output would be a weighted sample from the index dataset:
results1 <- c(rep("A1",14),rep("A2",4),rep("B",19,),rep("C1",9),rep("C2",3),"C3")
Ultimately trying to join zip codes that match to multiple fips codes from a probability distribution for the population.
This comment is a good description of what I'm trying to overcome: https://stackoverflow.com/a/13316857/4828653
Here's a potential solution I've come up with but given I have billions of records I need something much more performant.
test_function <- function(x) {
index.dat %>%
filter(zip == x) %>%
sample_n(size=1,weight=prob) %>%
select(fips)
}
results2 <- lapply(test.sample$zip, function(x) test_function(x)) %>%
unlist() %>%
data.frame(fips = .)
> table(results1)
results1
A1 A2 B C1 C2 C3
14 4 19 9 3 1
> table(results2)
results2
A1 A2 B C1 C2 C3
15 3 19 8 2 3