1

So some background is needed before this can be fully explained. I have a list of the distances between counties in the US, and a number of days. I need for each county, the sum of the distances between that county and every other county for every day. So for instance, if there are four counties, A, B, C and D, and 10 days, I want the distance AB+AC+AD (sum of the distances between A and every other county), BA+BC+BD, CA+CB+CD, for each of the 10 days. I know this sounds bizarre out of context, but it's hard to explain why without fully explaining my research.

So for a reprex. Consider 3 counties and 3 days.

dist <- runif(9)
source <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")

dest <- rep(c("A", "B", "C"), 3)
a
d1 <- as.data.frame(cbind(dest, source, dist))
d1$dist[b == c] <- 0
d1
county <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
lockdown <- c(0, 1, 0, 1, 1, 0, 1, 0, 1)
day <- c(1, 2, 3)
day <- rep(day, 3)

d2 <- as.data.frame(cbind(day, county, lockdown))

Which gives us:

 +------+--------+-------------------+--+--+
|      |        |                   |  |  |
+------+--------+-------------------+--+--+
| dest | source | dist              |  |  |
| A    | A      | 0                 |  |  |
| B    | A      | 0.815869745099917 |  |  |
| C    | A      | 0.554592022672296 |  |  |
| A    | B      | 0.193494795123115 |  |  |
| B    | B      | 0                 |  |  |
| C    | B      | 0.708462748210877 |  |  |
| A    | C      | 0.729813229991123 |  |  |
| B    | C      | 0.351678870152682 |  |  |
| C    | C      | 0                 |  |  |
+------+--------+-------------------+--+--+

This is a table of the distances between the three counties.

And then I have some panel data on the counties for each day:

+-----+--------+----------+--+--+
| day | county | lockdown |  |  |
+-----+--------+----------+--+--+
|   1 | A      |        0 |  |  |
|   2 | A      |        1 |  |  |
|   3 | A      |        0 |  |  |
|   1 | B      |        1 |  |  |
|   2 | B      |        1 |  |  |
|   3 | B      |        0 |  |  |
|   1 | C      |        1 |  |  |
|   2 | C      |        0 |  |  |
|   3 | C      |        1 |  |  |
+-----+--------+----------+--+--+

Now I want to loop over the counties and for each day, create a column in d2 that contains the sum of the distances between the county and each other county that isn't in lockdown. So think AB+AC+AD if B,C, and D aren't in lockdown on that particular day. I use a loop like this:

# these aren't numeric for some reason
d1$dist <- as.numeric(d1$dist)
d2$lockdown <- as.numeric(d2$lockdown)

# First loop over the source counties
for (i in unique(d1$source)) {
  # Next loop over the days
  for (j in unique(d2$day)) {
    # finally, for each destination county, fill in the entry in d2 with the sum of the distance times if the county is in lockdown

    for (k in unique(d1$dest)) {
      ifelse(i == k, d2$ma[d2$county == i & d2$day == j] <- d2$ma[d2$county == i & d2$day == j],
        d2$ma[d2$county == i & d2$day == j] <- d2$ma[d2$county == i & d2$day == j] + d1$dist[d1$source == i & d1$dest == k] * d2$lockdown[d2$day == j & d2$county == i]
      )
    }
  }
}

This works fine for such a small dataset, but there are 3000 counties, and 150 days so well over a billion iterations that need to be looped over. Is there a way to write this loop more efficiently? I tried using vectorized notation but since the output needs to be matched to d2, in the form of a new column, it's not really feasible.

Waldi
  • 39,242
  • 6
  • 30
  • 78

1 Answers1

2

This could be done using data.table which is way more efficient than loops:

  1. join d1 & d2 together
  2. group by source,day and sum distances for lockdown==1
set.seed(1)
dist <- runif(9)
source <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")

dest <- rep(c("A", "B", "C"), 3)
d1 <- data.frame(dest = dest,source = source,dist =  dist)
d1$dist[d1$source == d1$dest] <- 0
d1
county <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
lockdown <- c(0, 1, 0, 1, 1, 0, 1, 0, 1)
day <- c(1, 2, 3)
day <- rep(day, 3)

d2 <- data.frame(day = day, county = county, lockdown = lockdown)
library(data.table)
setDT(d1)
setDT(d2)
d <- d1[d2, on = .(dest = county),allow.cartesian=T]
d[,.(distance = sum(dist*lockdown)),by = .(source,day)]

   source day  distance
1:      A   1 0.9449773
2:      B   1 0.8983897
3:      C   1 0.6607978
4:      A   2 0.3721239
5:      B   2 0.9082078
6:      C   2 1.6054731
7:      A   3 0.5728534
8:      B   3 0.8983897
9:      C   3 0.0000000

Note that I simplified d1 & d2 creation scripts to remove the numeric problem you encountered due to cbind casting all columns to the type of the first column (character)

Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thanks! This is very helpful. Datatable certainly helps me save a lot of space. Unfortunately, in this situation, the d2 dataset is 3 million observations and d1 is about 400,000. With cartesian on, there's still memory issues. My laptop has 32 gigs of RAM so upgrading probably isn't the problem. The issue is with the merging step. I was hoping for some way to loop without merging. – Syed Mohammad Ali Abidi Sep 29 '20 at 17:05
  • Perhaps you only need neighbouring counties, which would avoid full cartesian product. See a similar [example](https://stackoverflow.com/a/62472640/13513328) – Waldi Sep 29 '20 at 17:54