So some background is needed before this can be fully explained. I have a list of the distances between counties in the US, and a number of days. I need for each county, the sum of the distances between that county and every other county for every day. So for instance, if there are four counties, A, B, C and D, and 10 days, I want the distance AB+AC+AD (sum of the distances between A and every other county), BA+BC+BD, CA+CB+CD, for each of the 10 days. I know this sounds bizarre out of context, but it's hard to explain why without fully explaining my research.
So for a reprex. Consider 3 counties and 3 days.
dist <- runif(9)
source <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
dest <- rep(c("A", "B", "C"), 3)
a
d1 <- as.data.frame(cbind(dest, source, dist))
d1$dist[b == c] <- 0
d1
county <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
lockdown <- c(0, 1, 0, 1, 1, 0, 1, 0, 1)
day <- c(1, 2, 3)
day <- rep(day, 3)
d2 <- as.data.frame(cbind(day, county, lockdown))
Which gives us:
+------+--------+-------------------+--+--+
| | | | | |
+------+--------+-------------------+--+--+
| dest | source | dist | | |
| A | A | 0 | | |
| B | A | 0.815869745099917 | | |
| C | A | 0.554592022672296 | | |
| A | B | 0.193494795123115 | | |
| B | B | 0 | | |
| C | B | 0.708462748210877 | | |
| A | C | 0.729813229991123 | | |
| B | C | 0.351678870152682 | | |
| C | C | 0 | | |
+------+--------+-------------------+--+--+
This is a table of the distances between the three counties.
And then I have some panel data on the counties for each day:
+-----+--------+----------+--+--+
| day | county | lockdown | | |
+-----+--------+----------+--+--+
| 1 | A | 0 | | |
| 2 | A | 1 | | |
| 3 | A | 0 | | |
| 1 | B | 1 | | |
| 2 | B | 1 | | |
| 3 | B | 0 | | |
| 1 | C | 1 | | |
| 2 | C | 0 | | |
| 3 | C | 1 | | |
+-----+--------+----------+--+--+
Now I want to loop over the counties and for each day, create a column in d2 that contains the sum of the distances between the county and each other county that isn't in lockdown. So think AB+AC+AD if B,C, and D aren't in lockdown on that particular day. I use a loop like this:
# these aren't numeric for some reason
d1$dist <- as.numeric(d1$dist)
d2$lockdown <- as.numeric(d2$lockdown)
# First loop over the source counties
for (i in unique(d1$source)) {
# Next loop over the days
for (j in unique(d2$day)) {
# finally, for each destination county, fill in the entry in d2 with the sum of the distance times if the county is in lockdown
for (k in unique(d1$dest)) {
ifelse(i == k, d2$ma[d2$county == i & d2$day == j] <- d2$ma[d2$county == i & d2$day == j],
d2$ma[d2$county == i & d2$day == j] <- d2$ma[d2$county == i & d2$day == j] + d1$dist[d1$source == i & d1$dest == k] * d2$lockdown[d2$day == j & d2$county == i]
)
}
}
}
This works fine for such a small dataset, but there are 3000 counties, and 150 days so well over a billion iterations that need to be looped over. Is there a way to write this loop more efficiently? I tried using vectorized notation but since the output needs to be matched to d2, in the form of a new column, it's not really feasible.