14

I have a large spatial dataset (12M rows). The geometries are points on a map. For each row in the dataset, I'd like to find all the points that are within 500 meters of that point.

In r, using sf, I've been trying to do this by parallel looping through each row and running st_buffer and st_intersects, then saving the result as a list in a key-value format (the key being the origin point, the values being the neighbors).

The issue is that the dataset is too large. Even when parallelizing to upwards of 60 cores, the operation takes too long (>1 week and usually crashes).

What are the alternatives to this brute-force approach? Is it possible to build indexes using sf? Perhaps push the operation to an external database?

Reprex:

library(sf)
library(tidyverse)
library(parallel)
library(foreach)


# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())


## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)

# or just run in sequence:
registerDoSEQ()

neighbors <- foreach(ii = 1:nrow(nc)
                      , .verbose = FALSE
                      , .errorhandling = "pass") %dopar% {

                        l = 500 # 500 meters

                        # isolate the row as the origin point:
                        row_interest <- filter(nc, row_number()==ii)

                        # create the buffer:
                        buffer <- row_interest %>% st_buffer(dist = l)

                        # extract the row numbers of the neighbors
                        comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]

                        # get all the neighbors:
                        comps <- nc %>% filter(row_number() %in% comps_idx)

                        # remove the geometry:
                        comps <- comps %>% st_set_geometry(NULL)

                        # flow control in case there are no neibors:
                        if(nrow(comps)>0) {
                          comps$Origin_Key <- row_interest$Id
                        } else {
                          comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
                          comps$Origin_Key <- row_interest$Id
                        }


                        return(comps)
                      }

closeAllConnections()

length(neighbors)==nrow(nc)
[1] TRUE
Tim_K
  • 659
  • 10
  • 24
  • could you give a minimal example so we can try something ? See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – denis Feb 06 '18 at 19:14
  • Apologies, I'd thought the example code I provided should be sufficient? What about the example I've posted isn't up to the standard of being a reproducible example? – Tim_K Feb 06 '18 at 22:54
  • @Tim_K In the end I got curious and I implemented an integrated sf + data.table possible solution. You may be interested in the updated answer below. – lbusett Feb 20 '18 at 15:47
  • You should consider taking a look at this post: https://gis.stackexchange.com/questions/255671/approximate-distance-between-two-points-longitude-latitude-without-haversine ; I had the same problem and solved it with an approximation and `data.table` subsetting, which can be easily run in parallel also. I am not sure whether it is the fastest way to do it, but for 9*10^6 it takes about 80 hours on a single core, 40 hours on 2 cores and so on. – nilsole Feb 22 '18 at 09:57
  • 1
    nilsole that post is helpful for thinking through the problem. The solution proposed is to pre-filter with a square-subset before doing the point-in-polygon calculation. Similar to @lbusett 's answer below, but, the subsetting is done on each individual point rather than carving the entire plane into an nxn grid – Tim_K Feb 24 '18 at 18:42

3 Answers3

13

When working with sf objects, explicitly looping over features to perform binary operations such as intersects is usually counterproductive (see also How can I speed up spatial operations in `dplyr::mutate()`?)

An approach similar to yours (i.e., buffering and intersecting), but without the explicit for loop works better.

Let's see how it performs on a reasonably big dataset of 50000 points:

library(sf)
library(spdep)
library(sf)

pts <- data.frame(x = runif(50000, 0, 100000),
                  y = runif(50000, 0, 100000))
pts     <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords  <- sf::st_coordinates(pts)

microbenchmark::microbenchmark(
  sf_int = {int <- sf::st_intersects(pts_buf, pts)},
  spdep  = {x   <- spdep::dnearneigh(coords, 0, 5000)}
  , times = 1)
#> Unit: seconds
#>    expr       min        lq      mean    median        uq       max neval
#>  sf_int  21.56186  21.56186  21.56186  21.56186  21.56186  21.56186     1
#>   spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683     1

You can see here that the st_intersects approach is 5 times faster than the dnearneigh one.

Unfortunately, this is unlikely to solve your problem. Looking at execution times for datasets of different sizes we get:

subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
  pts_sub <- pts[1:sub,]
  buf_sub <- pts_buf[1:sub,]
  t0 <- Sys.time()
  int <- sf::st_intersects(buf_sub, pts_sub)
  times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}

plot(subs, times)

times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#> 
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#> 
#> Residuals:
#>        1        2        3        4        5        6        7 
#> -0.16680 -0.02686  0.03808  0.21431  0.10824 -0.23193  0.06496 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.429e-01  1.371e-01   1.772    0.151    
#> subs        -2.388e-05  1.717e-05  -1.391    0.237    
#> I(subs^2)    8.986e-09  3.317e-10  27.087  1.1e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared:  0.9996, Adjusted R-squared:  0.9994 
#> F-statistic:  5110 on 2 and 4 DF,  p-value: 1.531e-07

Here, we see an almost perfect quadratic relationship between time and number of points (as would be expected). On a 10M points subset, assuming that the behaviour does not change, you would get:

predict(reg, newdata = data.frame(subs = 10E6))
#>        1 
#> 898355.4

, which corresponds to about 10 days, assuming that the trend is constant when further increasing the number of points (but the same would happen for dnearneigh...)

My suggestion would be to "split" your points in chunks and then work on a per-split basis.

You could for example order your points at the beginning along the x-axis and then easily and quickly extract subsets of buffers and of points with which to compare them using data.table.

Clearly, the "points" buffer would need to be larger than that of "buffers" according to the comparison distance. So, for example, if you make a subset of pts_buf with centroids in [50000 - 55000], the corresponding subset of pts should include points in the range [49500 - 55500]. This approach is easily parallelizable by assigning the different subsets to different cores in a foreach or similar construct.

I do not even know if using spatial objects/operations is beneficial here, since once we have the coordinates all is needed is computing and subsetting euclidean distances: I suspect that a carefully coded brute force data.table-based approach could be also a feasible solution.

HTH!

UPDATE

In the end, I decided to give it a go and see how much speed we could gain from this kind of approach. Here is a possible implementation:

points_in_distance_parallel <- function(in_pts,
                                        maxdist,
                                        ncuts = 10) {

  require(doParallel)
  require(foreach)
  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]

  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "xcut"
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
    setkey(xcut, ycut)

  results <- list()

  cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
                                ifelse(.Platform$OS.type != "windows", "FORK",
                                       "PSOCK"))
  doParallel::registerDoParallel(cl)
  # start cycling over quadrants
  out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {

    count <- 0

    # get the points included in a x-slice extended by `dist`, and build
    # an index over y
    min_x_comp    <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)

    for (cuty in seq_len(pts$ycut)) {

      count <- count + 1

      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_y[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_y[cuty + 1],
                            (limits_y[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]

      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>%
        st_buffer(maxdist)

      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)

      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"

      inters <- sf::st_intersects(subpts_buf, subpts_comp)

      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))

    }
    return(data.table::rbindlist(results))
  }
parallel::stopCluster(cl)
data.table::rbindlist(out)
}

The function takes as input a points sf object, a target distance and a number of "cuts" to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the "ids" of the points within maxdist are reported in the int_ids list column.

On on a test dataset with a varying number of uniformly distributed point, and two values of maxdist I got these kind of results (the "parallel" run is done using 6 cores):

enter image description here

So, here we get a 5-6X speed improvement already on the "serial" implementation, and another 5X thanks to parallelization over 6 cores. Although the timings shown here are merely indicative, and related to the particular test-dataset we built (on a less uniformly distributed dataset I wouldexpect a lower speed improvement) I think this is quite good.

HTH!

PS: a more thorough analysis can be found here:

https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html

lbusett
  • 5,801
  • 2
  • 24
  • 47
  • For documentation purposes, I thought this comment from the SO question at the top of your answer seemed relevant: "avoid row-wise operations if the step involves a binary logical predicates (like st_intersects, st_crosses, etc.) because you lose the spatial indexing efficiency boost" – Tim_K Feb 09 '18 at 18:30
1

I have two alternatives, one that seems faster, and one that is not. The faster method may not be amenable for parallelization, unfortunately, and so it may not help.

library(sf)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)

dis <- 50000
result <- list()

Your approach

system.time(
for (i in 1:nrow(pts)) {
    b <- st_buffer(pts[i,], dist = dis)
    result[[i]] <- st_intersects(b, nc)[[1]]
}
)

Slower alternative

system.time(
for (i in 1:nrow(pts)) {
    b <- as.vector(st_distance(pts[i,], pts))
    result[[i]] <- which(b <= dis)
}
)

For smaller datasets, without looping:

x <- st_distance(pts)
res <- apply(x, 1, function(i) which(i < dis)) 

Faster alternative (not obvious how to do in parallel), and perhaps an unfair comparison as we do not do the looping ourselves

library(spdep)
pts2 <- st_coordinates(pts)
system.time(x <- dnearneigh(pts2, 0, dis))

I would first get a list with the indices that indicate the neighbors, and extract attributes after that (that should be fast)

Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63
  • Based off your answer, I was able to find this blog post which further discusses this same topic: cran.r-project.org/web/packages/spdep/vignettes/nb_sf.html Same technique as above can be applied while staying within sf, e.g., x <- dnearneigh(st_coordinate(pts), 0, dis) – Tim_K Feb 07 '18 at 17:08
0

Working off of RobertH's answer, it is a bit faster to extract coordinates using sf::st_coordinates in this particular example.

library(sf)
library(spdep)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)

dis <- 50000

# quickest solution:
x <- spdep::dnearneigh(sf::st_coordinates(pts), 0, dis)

microbenchmarking:

my_method <- function(pts) {
  result <- list()
  for (i in 1:nrow(pts)) {
    b <- st_buffer(pts[i,], dist = dis)
    result[[i]] <- st_intersects(b, nc)[[1]]
  }
  result
}

library(microbenchmark)

microbenchmark(
  my_method(pts),
  dnearneigh(as(pts, 'Spatial'), 0, dis),
  dnearneigh(st_coordinates(pts), 0, dis)
)

Unit: microseconds
                                    expr        min          lq        mean      median          uq        max neval
                          my_method(pts) 422807.146 427434.3450 435974.4320 429862.8705 434968.3975 596832.271   100
  dnearneigh(as(pts, "Spatial"), 0, dis)   3727.221   3939.8540   4155.3094   4112.8200   4221.9525   7592.739   100
 dnearneigh(st_coordinates(pts), 0, dis)    394.323    409.5275    447.1614    430.4285    484.0335    611.970   100

checking equivalence:

x <-  dnearneigh(as(pts, 'Spatial'), 0, dis)
y <- dnearneigh(st_coordinates(pts), 0, dis)

all.equal(x,y, check.attributes = F)
[1] TRUE
Tim_K
  • 659
  • 10
  • 24
  • `as(pts, 'Spatial')` transform a `sf` object to a `Spatial*` object as defined in sp. It is not part of `spdep`. `dnearneigh` accepts both a Spatial object of a matrix of coordinates. Extracting the coordinates is faster, but both approaches are fast, and you only need to do this once for you whole data set so the difference should not be that important. (it should scale more or less linearly --- whereas the distance computations do not) – Robert Hijmans Feb 07 '18 at 17:46
  • You're absolutely right. I tweaked the language in my answer to address that. My example above is very specific to this use case and doesn't necessarily apply in general. – Tim_K Feb 07 '18 at 21:11