-1

I'm trying to fill in a large Matrix with 217Mio elements. I tested my method with a small sample and it did what i wanted. However using the (almost) full sized data it takes for ever. My machine is calculating since almost 48 hours now. I'm looking for help to speed things up.

First dataframe contains origins (population) and the second destinations(supply location). Each origin and destination contains a ID indicating in which zone (census track) they are in. The matrix contains traveltimes for all possible combinations for travel from one zone to another.

first dataframe containing origins sptl_pop19_intrsct_zns

pop_index ID
P_000001 1000101
P_000002 1000101
... ...

second dataframe containing destinations sptl_sup_intrsct_zns

sup_index ID
S_0001 2000101
S_0002 2000101
... ...

these two dateframe contain more than just these two variables that aren't relevant for now.

Matrix containing traveltimes (mins) for all zone combinations NPVM_mtrx

1000101 ... 2000101 ...
1000101 10 ... 60 ...
... ... ... ... ...
2000101 60 ... 14 ...

my goal is to populate a matrix in this form demsup_mtrx

P_000001 P_000002 ...
S_0001 60 60 ...
S_0002 60 60 ...
... ... ... ...

I generated this matrix already but completely empty

For now I'm using two for loops to iterate through the empty matrix looking up the zone IDs for each origin-supply-pair.

for (r in rownames(demsup_mtrx)){
  # Get Zone ID from Supply Origin
  Sup_znID <- filter(sptl_sup_intrsct_zns, sup_index == r )%>%
    pull(ID)
  for (c in colnames(demsup_mtrx)){
    
    # Get Zone ID from Demand Origin
    Pop_znID <- filter(sptl_pop19_intrsct_zns, pop_index == c )%>%
      pull(ID)
    # Get travel duration for Pop-Sup pair  
    Pop_Sup_duration <- NPVM_mtrx[as.character(Pop_znID),as.character(Sup_znID)]
    # Fill in Matrix with Traveltime
    demsup_mtrx[r,c] = Pop_Sup_duration
  }}

For now I'm suspecting that the dplyr filter functions are the bottleneck but im not sure.

Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29

1 Answers1

1

(Note: I'm aware that this answer is not following the specs in the title of the question ("using dplyr filter functions"), but I think it will solve the overall question ...)

I think the bottleneck, rather than specifically dplyr::filter, is trying to pick out the elements one at a time ...

Based on your description I think this is right but I might have missed something.

example

I simplified your example to include only the focal elements explicitly shown in your question (but this should extend to arbitrarily large problems).

library(tidyverse)
## population
origins <- tibble(pop_index = c("P1", "P2"), id = rep(1000, 2))
dests <- tibble(sup_index = c("S1", "S2"), id = rep(2000, 2))
m <- matrix(c(10,60,60,14), 2,
            dimnames = list(c(1000, 2000), c(1000, 2000)))

solution (?)

The key steps are (1) use expand.grid() to get all combinations of origins and destinations; (2) index the distance matrix by using this two-column matrix.

## expand.grid() has element 1 (pop indices/ids) varying fastest
M <- as.matrix(expand.grid(origins$id, dests$id))
## next step is unnecessary if `id` elements are already character ...
storage.mode(M) <- "character"
matrix(m[M],
       byrow = TRUE,
       nrow = nrow(dests),
       dimnames= list(dests$sup_index, origins$pop_index))

The last step is a little tricky. We

  • extract the elements of m corresponding to the row/column indices (names) in M.
  • This gives us a vector; we need to reshape it into a matrix with the right dimensions (matrix(..., byrow = TRUE, nrow ... )). (By default R fills matrices in column-first order; byrow = TRUE modifies this.)

A tidyverse solution might also work (using expand(), left_join(), etc., but you will have to work in 'long format' up to the last step, then reshape the answers into a matrix. (This question will be useful for "melting" your initial distance matrix into a long tibble ...)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • thank you a lot, I apologize If i didnt formulate the issue at hand very well. As you probably saw its my first time using stackoverflow. – SirMatealot Mar 06 '23 at 11:23
  • Does this do what you want? I didn't spend the time to try it on a larger example/benchmark it against your code. – Ben Bolker Mar 06 '23 at 14:37
  • just finished running the code and it works and does what it was supposed to. It didnt take 15min despite the huge amounts of elements. The only thing i had to change somehow was nrow = nrow(origings) (i replaced it with destinations) bc i got an error that the array lenght of dinnames didn't add up. I think that I didn't make it very clear that I dont have the same amount of destinations as origins. But I'm yet to understand what matrix(m[M])) excaclty does tho:) – SirMatealot Mar 06 '23 at 22:24