Number of points within a radius with large datasets- R

Question

I have county parcel level shapefiles and I'm aiming to calculate the number of parcels within a mile (about 1610 meters) in total, as well as with the same owner. I've worked through a solution, and below is my sample code, but it is fairly inefficient and memory intensive. I cannot publicly post the data, but here is the problem with some made up code:

library(rgdal)
library(rgeos)
library(geosphere)


nobs<-1000  # number of observations
nowners<-50 # number of different owners
crs<-"+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
long<-runif(nobs,min=-94.70073, max=-94.24141) #roughly adair county in iowa
lat<-runif(nobs,min=41.15712,max=41.50415) #roughly adair county in iowa
coords<-cbind(long,lat)
owner<-sample(1:nowners,nobs, replace=T) # give id's to owners 
df<-as.data.frame(owner)
centroids<-SpatialPointsDataFrame(coords,df,proj4string = CRS(crs)) # make spatial dataframe 

d<-distm(centroids) # distance from centroids to other centroid

numdif<-matrix(0,length(owner)) #vectors of 0s to be replaced in loop
numtot<-matrix(0,length(owner))
for (i in 1:length(owner)) {
  same_id<-df$owner[i]==owner ## identify locations with same owner ID 
  numdif[i]<-as.numeric(sum(d[i,]<1609.34 & same_id==F)) #different parcel owners
  numtot[i]<-as.numeric(sum(d[i,]<1609.34)) #total parcels
}

The resulting "numdif" and "numtot" vectors give me what I want: A vector of number of neighboring parcels with different owners and total, respectively. However, this process is incredibly time consuming and memory intensive for my counties that have a far larger "nobs." Some counties have 50-75,000 observations (so the resulting matrix m has billions of elements and will likely require far more memory than I have). Does anyone have thoughts about a better way of approaching this problem, from both a speed and memory perspective? Help is greatly appreciated.

It would be very helpful if you generated some example data. See: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5965451#5965451 — Robert Hijmans, Dec 21 '17 at 03:05

Robert Hijmans · Accepted Answer · 2017-12-21T18:56:49.367

0

You could have done the counts in an apply

d <- d < 1609.34
nt <- apply(d, 1, sum)
nd <- apply(d, 1, function(i) length(unique(owner[i]))) - 1

I think that your computation of numdif is not correct, as it includes other owners multiple times if they have multiple parcels.

Given the large number of observations, I would consider this route:

d <- lapply(1:nrow(coords), function(i) which(distGeo(coords[i, ,drop=FALSE], coords) < 1609.34))
ntot <- sapply(d, length)
ndif <- sapply(d, function(i) length(unique(owner[i]))) - 1

That is slower, but it won't create a crazy large matrix

I should also add that your approach assumes that the parcels are small relatively to the distance considered, such that using the centroid is OK. If this is not the case, it is possible to do the computations on the polygons with rgeos::gWithinDistance, at increased computational cost.

edited Dec 21 '17 at 18:56

answered Dec 21 '17 at 18:46

Robert Hijmans

40,301
4
55
63

Robert, I'm actually ok with multiple owners multiple times, so that will not be a problem. I am traveling now but will try this and let you know how it goes in the coming days. Thank you – winitheju Dec 21 '17 at 19:46
Robert, This worked. It is much slower, however. 10,000 points took about 2 minutes, while 35,000 points took about 25! This might be unavoidable though. Thanks for your help – winitheju Jan 02 '18 at 17:11

Number of points within a radius with large datasets- R

1 Answers1