1

I am trying to use bigmemory in R to compute distance matrices for more than 100,00,000 (rough estimate) rows and 16 columns

A small subset of the data looks like this

list1 <- data.frame(longitude = c(80.15998, 72.89125, 77.65032, 77.60599, 
                                  72.88120, 76.65460, 72.88232, 77.49186, 
                                  72.82228, 72.88871), 
                    latitude = c(12.90524, 19.08120, 12.97238, 12.90927, 
                                 19.08225, 12.81447, 19.08241, 13.00984,
                                 18.99347, 19.07990))
list2 <- data.frame(longitude = c(72.89537, 77.65094, 73.95325, 72.96746, 
                                  77.65058, 77.66715, 77.64214, 77.58415,
                                  77.76180, 76.65460), 
                    latitude = c(19.07726, 13.03902, 18.50330, 19.16764, 
                                 12.90871, 13.01693, 13.00954, 12.92079,
                                 13.02212, 12.81447), 
                    locality = c("A", "A", "B", "B", "C", "C", "C", "D", "D", "E"))


library(geosphere)

# create distance matrix
mat <- distm(list1[,c('longitude','latitude')], list2[,c('longitude','latitude')], fun=distHaversine)

# assign the name to the point in list1 based on shortest distance in the matrix
list1$locality <- list2$locality[max.col(-mat)]

How can I use bigmemory to build massive dist matrices?

Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
  • If this project is feasible, I suspect it would go something like this: create and empty big matrix of type double that is the size that you need (100M X 100M) or a 3 column matrix with the appropriate number of rows. You can see an example in the vignette that uses the required file backing to save it to disk. Then write a loop to fill it in. There may be some useful tool in the `biganalytics` package, or maybe in `bigmemory` itself, but you may just have to resort to a nested `for` loop that fills in the matrix. – lmo Jun 01 '17 at 18:00
  • 1
    100M x 100M isn't realistic, 100K x 100K is already taking 74.5GB. If you only need to access distances, you should compute them online. Nevertheless, I think the best way to compute a `big.matrix` of distances is to compute them block by block as standard R matrices (only the blocks in the lower triangle and diagonal). – F. Privé Jun 02 '17 at 05:31
  • The matrix is 100k+ x 16. I have many rows – Hardik Gupta Jun 02 '17 at 05:34
  • If you have only 100K rows, you should edit your question with the right number. – F. Privé Jun 02 '17 at 05:37
  • Prive, I have many rows (I actually don't even know the exact count) but for sure the rows are greater than 100,00,000 Think of a very large matrix – Hardik Gupta Jun 02 '17 at 05:39

1 Answers1

3

Something like this works for me:

library(bigmemory)
library(foreach)

CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
  int <- m / nb
  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))
  cbind(lower, upper, size)
}

seq2 <- function(lims) {
  seq(lims[1], lims[2])
}

n <- nrow(list1)
a <- big.matrix(n, n, backingfile = "my_dist.bk",
                descriptorfile = "my_dist.desc")

intervals <- CutBySize(n, block.size = 1000)
K <- nrow(intervals)

doParallel::registerDoParallel(parallel::detectCores() / 2)
foreach(j = 1:K) %dopar% {
  ind_j <- seq2(intervals[j, ])
  foreach(i = j:K) %do% {
    ind_i <- seq2(intervals[i, ])
    tmp <- distm(list1[ind_i, c('longitude', 'latitude')], 
                 list2[ind_j, c('longitude', 'latitude')], 
                 fun = distHaversine)
    a[ind_i, ind_j] <- tmp
    a[ind_j, ind_i] <- t(tmp)
    NULL
  }
}
doParallel::stopImplicitCluster()

I repeated your list 1000 times to test with 10K rows.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • when I have a file with row count 1314525, it gives me this error `Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), : Problem creating filebacked matrix.` – Hardik Gupta Jun 02 '17 at 07:51
  • the issue is, your code works for 10k rows but not for 100k rows, it is not creating that matrix – Hardik Gupta Jun 02 '17 at 07:57
  • Looks like I need to run it the traditional slow way :( – Hardik Gupta Jun 02 '17 at 08:10
  • 1
    @F.Privé @Hardik gupta. Did either of you ever figure out a solution to this problem? I am trying to create a distance matrix between two data frames (they are both ~200K rows long, and each has a latitude column and a longitude column, that is it. However I get the same error as Hardik when I try your code `Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), : Problem creating filebacked matrix.` – Ana Apr 16 '18 at 17:17
  • There is some problem with the creation of the backing file. Please fix – Jerin Mathew Feb 26 '21 at 07:16
  • @JerinMathew What do you mean? This runs perfectly fine after getting `list1` from OP's question. – F. Privé Feb 26 '21 at 07:43