2

I want to create a distance matrix between companies using their geographical locations.

I have a square distance matrix that contains the distances between 98 Italian provinces. I also have a a data frame with two columns. One column has the ID numbers for 8376 companies. The other column indicates in which of the 98 provinces that each one of these companies is located.

I want to create a 8376 by 8376 distance matrix that contains the distances between all the companies. The code I have written (below) is extremely inefficient. Is there anyway to do this more quickly? I'm asking because I need to this for multiple datasets.

Thus is what the data frame looks like

   cid province
1  61       TO
2 102       TO
3 123       AT
4 127       TO
5 158       TO
6 225       NO
7 232       TO
8 388       TO

This is what the square distance matrix looks like

     CH     AQ      PE       TE
1     0  64.39   41.74    81.18
2 64.39   0      40.38    61.05
3 41.74  40.38    0       40.79
4 81.18  61.05   40.79     0              


outcome = matrix(NA,8376,8376)  # empty matrix

for(i in 1:8376){
  for(j in (i+1):8376){
    x=which(dist.codes[,1]==companyID_Province[i,2]) # Find the row index in the distance matrix 
    y=which(dist.codes[1,]==companyID_Province[j,2]) # Find the column index in the distance matrix
    outcome[i,j] = dist.codes[x,y] # Specify the distance to the corresponding element in outcome matrix
  }
}
PaulWagner
  • 21
  • 3
  • Please provide a reproducible example. Show a small subset of `dist.codes` and `company_Province`. – nicola Oct 29 '15 at 11:08
  • In your code, `outcome[i,j] = dist.codes[x,y] ` suggests that `dist.codes` is the distance matrix of the provinces. But then `y=which(dist.codes[1,]==companyID_Province[j,2])` compares a distance with a company ID. – mra68 Oct 29 '15 at 14:02
  • dist.codes is the distance matrix of the provinces. x is the row index, y is the column index. Both are using the companyID_Province data frame. i have corrected that now, as before there was a mistake. In the code i am running it is correct, but stilll taking 24 hours plus, because there is millions of cells to fill – PaulWagner Oct 29 '15 at 14:50
  • Hi Nicola, I am currently running the code so i cant post subsets right now. I will do later today (if the run terminates). the data frame 'companyID_Province' is 8376* 2. The square distance matrix 'dist.codes' is 98*98 – PaulWagner Oct 29 '15 at 14:54
  • Your code uses only one column of `companyID_Province`, the second one. what is the first column good for? Does the second column contain the ID´s or the provinces? – mra68 Oct 29 '15 at 15:16
  • Does `dist.codes` only contain distances, or is the first row and the first column filled with province numbers? – mra68 Oct 29 '15 at 15:18
  • Hi mra68, the second column contains the province codes. You're right, I dont use the first column. The first column and the first row of the dist.codes matrix contain the province codes. – PaulWagner Oct 29 '15 at 15:34
  • **Why?** As in, what do you wish to do with the larger distance matrix? This seems an incredibly inefficient way to accomplish whatever goal that is. the 8376 x 8376 matrix will have much duplication since at least one province must have > 1 company. It seems far more appropriate to only look up the distances that you need when you need them, instead of looking them all up and then only using some of these outputs. – alexwhitworth Oct 29 '15 at 20:29
  • @Alex, The reslting matrix will be used in logistic spatial regression models – PaulWagner Oct 29 '15 at 21:49
  • hmm... still seems inefficient to me. But spatial statistics are definitively **not** my area of expertise, so I'll trust you. – alexwhitworth Oct 29 '15 at 22:09

1 Answers1

1

If dist.codes is the distance matrix of the provinces and province[i] is the province of the company with ID i, then dist.codes[province,province] is he distance matrix of the companies. If company is the data frame with company ID´s in company$ID and province numbers in company$province, then company$province[order(company$ID)] is the vector province above, ordered by company ID´s.

I've compared your code with my proposal:

SpeedComparison <- function(N,M)
{
  set.seed(1)

  dist.codes <- matrix(sample(1:1000,N*N,rep=TRUE),N,N) / 100
  dist.codes <- dist.codes * t(dist.codes)
  diag(dist.codes) <- 0
  dist.codes <- cbind(0:N,rbind(1:N,dist.codes)) # Add an additional row and an additional column with province numbers.

  companyID_Province <- data.frame( ID = 1:M, province = sample(1:N,M,replace=TRUE) )

  #---------------------------------------------------------------------

  tm.1 <- 0.01 * system.time(
    for ( i in 1:100)
    {
      outcome.1 = matrix(0,M,M)  # empty matrix

      for(i in 1:(M-1)){
        x=which(dist.codes[,1]==companyID_Province[i,2]) # Find the row index in the distance matrix 
        for(j in (i+1):M){
          y=which(dist.codes[1,]==companyID_Province[j,2]) # Find the column index in the distance matrix
          outcome.1[i,j] = dist.codes[x,y] # Specify the distance to the corresponding element in outcome matrix
        }
      }
    }
  )

  tm.2 <- 0.01 * system.time(
    for ( i in 1:100)
    {
      D <- dist.codes[-1,][,-1] # The additional row/column is not used here.
      outcome.2 <- D[companyID_Province[,2],companyID_Province[,2]]
    }
  )

  list( outcome = list( outcome.1+t(outcome.1), outcome.2 ),
        time    = list( tm.1, tm.2 ) )
}

#======================================================================

N <- 50

Comparison <- as.data.frame(matrix(NA,0,4))

for ( M in c(100,150,200,250,300) )
{
  Test <- SpeedComparison(N,M)

  Comparison <- rbind( Comparison,
                       c( M,
                          Test$time[[1]][3],
                          Test$time[[2]][3],
                          identical(Test$outcome[[1]],Test$outcome[[2]])))
}  

names(Comparison) <- c("M","time.1","time.2","outcomes.identical")

The outcomes are equal ( "1" means TRUE ), the times are anything but equal:

> Comparison
    M time.1 time.2 outcomes.identical
1 100 0.2568  2e-04                  1
2 150 0.5661  5e-04                  1
3 200 1.1845  7e-04                  1
4 250 1.9568  1e-03                  1
5 300 2.8602  4e-03                  1
> 
mra68
  • 2,960
  • 1
  • 10
  • 17
  • Hi, sorry but I'm confused by your answer. the code works, but it is super inefficient. I'm looking for an alternative way to achieve the same goal. any suggestions would be greatly appreciated – PaulWagner Oct 29 '15 at 14:59
  • @PaulWagner You need to vectorize your code. But you haven't given a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), which makes it hard for anyone to help you. mra68 has done a pretty good job here.... If you want a "better" answer, you need to provide a better question--IE one with data that is reproducible. – alexwhitworth Oct 29 '15 at 20:24
  • @PaulWagner: I've enhanced my answer by a comparison with your code, including a speed test. If you are still confused, I suggest that you do not simply apply my solution, but try to understand it first. This way you can learn something about R. – mra68 Oct 30 '15 at 09:07
  • "mra68 Thanks. I'll go through it this morning. I appreciate your time – PaulWagner Oct 30 '15 at 09:15