3

I am trying to create a custom distribution function (based on the haversine). right now my prototype is a double for loop. I've looked around on vectorized operations, and I am still very much learning (very new to R), so it is not clear how to clean this up. In the end I want an NxN matrix that compares the distance between points on the globe. Here is my test data for right now:

coord
     Latitude   Longitude
1    16.34577    6.303545
2    12.49475   28.626396
3    27.79462   60.032495
4    44.42699  110.114216
5   -69.85409   87.946878

the evil double for-loop:

for (i in 1:dim(coord)[1]){
  for(j in 1:dim(coord)[1]) # for each column {
    mymat[i,j] = coord[i,1]*coord[j,2]     # custom function for future
  }
} 

Result:

           X1         X2         X3        X4        X5
1   103.03629   467.9204   981.2773  1799.902  1437.559
2    78.76122   357.6796   750.0910  1375.850  1098.874
3   175.20461   795.6596  1668.5801  3060.582  2444.450
4   280.04755  1271.7847  2667.0632  4892.043  3907.215
5  -440.32840 -1999.6708 -4193.5152 -7691.928 -6143.449

Of course, for 5 samples, no problem. But I have a list of 100k.

I did see a function after a search

custom.dist <- function(x, my.dist) {
  mat <- sapply(x, function(x.1) sapply(x, function(x.2) my.dist(x.1, x.2)))
  as.dist(mat)
}

But I don't understand what is going on and couldn't get it to work, even with a dummy function like x*y

RDS
  • 476
  • 1
  • 6
  • 12
  • Remember that you also have `Rcpp` at your disposal - http://stackoverflow.com/questions/29054459/how-to-speed-up-or-vectorize-a-for-loop?rq=1 – Shawn Mehan Dec 09 '15 at 01:53
  • 2
    this is just the outer product of the first and second columns? i.e. `cbind(coord[,1]) %*% rbind(coord[,2])`? – MichaelChirico Dec 09 '15 at 01:56
  • 2
    `outer(coord[,1], coord[,2],"*")` – Khashaa Dec 09 '15 at 01:58
  • 1
    @Khashaa just peeked at the source for `outer` for kicks, when using `"*"` it's faster to just skip to `coord[ , 1] %*% t(coord[ , 2])` – MichaelChirico Dec 09 '15 at 02:00
  • 2
    Or do you really want to define a different function in the middle of your loop? If so, you should really specify what exactly it is you mean by `my.dist` – MichaelChirico Dec 09 '15 at 02:01

1 Answers1

4

Looks like you just want the outer product. There is a function for this - conveniently named outer. Now outer can apply functions other than multiplication but the default is multiplication so we don't need to specify it explicitly.

> coord <- cbind(1:5, 2:6)
> coord
     [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    4    5
[5,]    5    6
> outer(coord[,1], coord[,2])
     [,1] [,2] [,3] [,4] [,5]
[1,]    2    3    4    5    6
[2,]    4    6    8   10   12
[3,]    6    9   12   15   18
[4,]    8   12   16   20   24
[5,]   10   15   20   25   30

Note that this approach generalizes easily to other binary functions as well

> outer(coord[,1], coord[,2], FUN = paste0)
     [,1] [,2] [,3] [,4] [,5]
[1,] "12" "13" "14" "15" "16"
[2,] "22" "23" "24" "25" "26"
[3,] "32" "33" "34" "35" "36"
[4,] "42" "43" "44" "45" "46"
[5,] "52" "53" "54" "55" "56"
Dason
  • 60,663
  • 9
  • 131
  • 148
  • @MichaelChirico Write it as an answer. I'm leaving this since it's a valid answer. I know matrix operations are faster here but sometimes it's not bad to go with a more readable version and to me this is slightly easier to follow. – Dason Dec 09 '15 at 02:03
  • `coord[ , 1] %*% t(coord[ , 2])` for those that don't want to scroll up. It's really not a separate answer because it's almost exactly what `outer` is doing under the hood; I was only suggesting you include it as an addendum to your perfectly acceptable answer. My approach will save probably close to no time (unless `as.vector` is very slow for some reason). – MichaelChirico Dec 09 '15 at 02:27
  • This got me on the right track, what i really wanted was the outer product of 2 2XN matrices (2 columns, N row) `a and b`. Although `outer` works great with single vectors, it was really throwing me for a loop. What I ended up with (after much struggling) was `sapply(1:3, function(z) sapply(1:3, function(y) paste(a[y,1],a[y,2],b[y,1],b[y,2])))` I can then replace the "paste" portion with the haversine formula (or any other that accepts 4 inputs). I think my issue is I think of vectors as columns by default, rather than rows. – RDS Dec 11 '15 at 05:02