0

I'm looking to do a weighted join between two datasets:

library(tidyverse)
set.seed(1)

test.sample <- data.frame(zip=sample(1:3,50,replace = TRUE))

index.dat <- data.frame(zip=c(1,1,2,3,3,3),
                  fips=c("A1", "A2", "B", "C1", "C2","C3"),
                  prob=c(.75,.25,1,.7,.2,.1))

My expected output would be a weighted sample from the index dataset:

results1 <- c(rep("A1",14),rep("A2",4),rep("B",19,),rep("C1",9),rep("C2",3),"C3")

Ultimately trying to join zip codes that match to multiple fips codes from a probability distribution for the population.

This comment is a good description of what I'm trying to overcome: https://stackoverflow.com/a/13316857/4828653

Here's a potential solution I've come up with but given I have billions of records I need something much more performant.

test_function <- function(x) {
index.dat %>% 
filter(zip == x) %>% 
sample_n(size=1,weight=prob) %>% 
select(fips)
}

results2 <- lapply(test.sample$zip, function(x) test_function(x)) %>% 
unlist() %>% 
data.frame(fips = .)

> table(results1)
results1
A1 A2  B C1 C2 C3 
14  4 19  9  3  1 
> table(results2)
results2
A1 A2  B C1 C2 C3 
15  3 19  8  2  3 
SCDCE
  • 1,603
  • 1
  • 15
  • 28
  • 2
    Please use `set.seed()` when using functions such as `sample()` to ensure reproducibility – Sotos Oct 04 '21 at 13:15
  • My mistake on the Cs. The fips are assigned probabilities based on population distribution. These are just arbitrary values I set for a simple example. – SCDCE Oct 04 '21 at 13:42
  • The thing that I don't understand is `A1` and `A2` combine have probability 1 which gets 14 + 5 = 19 rows. `B` has probability 1 which gets 20 rows where as C has combined probability 1 as well and they have only 11 rows. How with same probability you get different number of rows ? What is the logic there ? – Ronak Shah Oct 04 '21 at 13:49
  • I've adjusted the result numbers to seed = 1. – SCDCE Oct 04 '21 at 13:50

1 Answers1

1

You can split index.dat according to the zip, to give a list of data frames for each individual zip code. If you use test.sample$zip to subset this list, you will get a list of 50 data frames with the appropriate zip codes. You can then sample the fip using the weights in the prob column of each data frame.

In your case, that would look like this:

sample_space <- split(index.dat, index.dat$zip)[test.sample$zip]

test.sample$fips <- sapply(sample_space, 
                           function(x) sample(x$fips, 1, prob = x$prob))

Now test.sample$fips will have a random fip chosen from the appropriate zip code, with the sampling done according to the relative weight. If we do a table of test.sampl$fips, we can see that the proportions are about right:

table(test.sample$fips)

#> A1 A2  B C1 C2 
#> 13  5 19 10  3 

The 18 members of zip 1 have been assigned to A1 and A2 with an (almost) 75:25 split. All members of zip 2 are given a B, as expected, and the 13 members of zip 3 have been assigned appropriately (though by chance no C3s have been selected due to its low probability)

If test.sample had 5000 rows, we would see that the proportions are much closer to the expected weightings due to the law of large numbers:

#>   A1   A2    B   C1   C2   C3 
#> 1257  419 1687 1153  325  159 
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • How would this scale? My sample space is billions and there are >40k zip codes. I used an lapply above which might parallelize easier I think but still not great. – SCDCE Oct 04 '21 at 14:48
  • @SCDCE it scales linearly - it should be O(n) complexity. The problem that will dog any solution here is memory usage. As long as you are able to do and store the equivalent of `split(index.dat, index.dat$zip)` to give your reference list of 40,000 zip codes and associated fips with weightings, then you can do your sampling a chunk at a time. – Allan Cameron Oct 04 '21 at 14:54
  • I'll give it a try, thanks for your time! – SCDCE Oct 04 '21 at 14:57