0

I have two datasets, one smaller dataset (4000 rows) that contains information about different stores with each shops lat/long coords and one large dataset(600K Rows) that contains information about people who live in the area with their lat/long coordinates. Im trying to find how many people live within a certain distance from each store. I also want to do this for multiple distances. IE- for every store find how many people live within 200 meters, 500 meters, 1KM, 2M from the store.

How can I go about doing this efficiently using R?

Brief pseudocode is below

for(store in stores){
  for(distance in distances){
    store[distance] <- find_people_within_distance(store,distance)
  }
}

find_people_within_distance(store,distance){
  # Return number of People in People dataset who's geocoordinates fall within the distance range from stores
}

Thanks!

aport550
  • 119
  • 1
  • 9
  • Have you already looked at answers like: https://stackoverflow.com/questions/31668163/geographic-geospatial-distance-between-2-lists-of-lat-lon-points-coordinates and https://stackoverflow.com/questions/21977720/r-finding-closest-neighboring-point-and-number-of-neighbors-within-a-given-rad ? – MrFlick Jul 06 '20 at 02:31
  • Yes unfortunately when I try these solutions my memory limit is exhausted. My laptop is only 8gb ram. – aport550 Jul 06 '20 at 04:38
  • Well that’s certainly going to be a problem for data that size. – MrFlick Jul 06 '20 at 04:43
  • take a look at `fuzzyjoin::geo_join()` – Wimpel Jul 06 '20 at 06:42

1 Answers1

0

I think that if you are not interested in the actual distance between people and stores but only in counting how many people live within a certain threshold distance from each store you can adopt the following approach.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

The example is based in Milan and the points are sampled at random from that area.

milan <- osmdata::getbb("Milan, Italy", format_out = "sf_polygon") %>% 
  magrittr::extract2("polygon") %>% 
  st_transform(crs = 3003) %>% 
  st_geometry()

simulate data for the stores and the people

stores <- st_sample(milan, size = 4000)
system.time({
  people <- st_sample(milan, size = 600000)
})
#>    user  system elapsed 
#>  127.50    1.80  142.74

Buffer the store points

system.time({
  stores_buffers <- st_buffer(stores, dist = units::set_units(200, "m"))
})
#>    user  system elapsed 
#>    0.68    0.01    0.75

Check how many people lie within each buffer:

system.time({
  people_ids <- st_contains(stores_buffers, people)
})
#>    user  system elapsed 
#>   15.02    0.36   16.44
head(lengths(people_ids))
#> [1] 430 427 410 393 438 274

Maybe there are more efficient approaches for this problems (for example you can check the R package polyclip) but, at the moment, it takes less than 20 seconds to estimate how many people live in those buffers.

Created on 2020-07-06 by the reprex package (v0.3.0)

agila
  • 3,289
  • 2
  • 9
  • 20