0

I am working on a problem where I am determining the nature of data points within a distance of each other datapoint. Basically, for each row of data, I try to determine the "neighborhood" of data points within a geographic range, and then figure out the characteristics of this "neighborhood."

The problem is that this is O^2 problem, as I currently have nested for loops meaning that I am running nrow^2 calculations (I have 70k rows, so 4.9B! calculations .... OUCH)

So the R (pseudo) code I have is

for (i in 1:n.geopoints) {
   g1<-df[i,]
   for (j in 1:n.geopoints) {
      g2 <- df[j,]
      if (gdist(lat.1 = g1$lat, lon.1=g1$lon, lat.2 = g2$lat, lon.2 = g2$lon, units = "m") <= 1000) {
         [[[DO SOME STUFFF]]]
      }
   }
}

How can this be accomplished in a more straightforward way? Is there a function I could lean on? Can I vectorize?

I have this in R, but can easily plop this over to Python if there is a better function available.

Thanks

dozyaustin
  • 611
  • 1
  • 6
  • 20
  • How fast would you like this to perform? Is your goal to find, for each point, all points within a small radius (neighborhood)? – Mateen Ulhaq Aug 19 '17 at 04:56
  • Yes exactly. I do not need it to be split second fast, but I am on hour 20 of running this on my local rig .. would love to get it under 5 hours. I know the nested for loop is a low hanging fruit, but I have struggled to reason my way through a way to remove that. – dozyaustin Aug 19 '17 at 05:10
  • 1
    Try looking into data structures other than lists. [k-d trees](http://pointclouds.org/documentation/tutorials/kdtree_search.php), for instance. The Point Cloud Library is sure to have something which fits your problem (and performs *very* fast). – Mateen Ulhaq Aug 19 '17 at 05:17

2 Answers2

1

Here's one approach that uses data.table, and a re-written haversine formula that I did for this question so that it will work inside data.table operations

The idea is to do a data.table join on every single point, to every single point, but within the join calculate the distance between each pair of points, and remove those that are outside the threshold. This is inspired by @Jaap 's excellent answer here

Setup

The haversine formula is

## Haversine formula
dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
  radians <- pi/180
  lat_to <- lat_to * radians
  lat_from <- lat_from * radians
  lon_to <- lon_to * radians
  lon_from <- lon_from * radians
  dLat <- (lat_to - lat_from)
  dLon <- (lon_to - lon_from)
  a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
  return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

The data I'm using for this example comes from my googleway package, and it's of the tram stops on the City Loop tram in Melbourne

library(googleway)

## Tram stops data
head(tram_stops)
#   stop_id                                     stop_name stop_lat stop_lon
# 1   17880           10-Albert St/Nicholson St (Fitzroy) -37.8090 144.9731
# 2   17892    10-Albert St/Nicholson St (East Melbourne) -37.8094 144.9729
# 3   17893 11-Victoria Pde/Nicholson St (East Melbourne) -37.8083 144.9731
# 4   18010    9-La Trobe St/Victoria St (Melbourne City) -37.8076 144.9709
# 5   18011  8-Exhibition St/La Trobe St (Melbourne City) -37.8081 144.9690
# 6   18030    6-Swanston St/La Trobe St (Melbourne City) -37.8095 144.9641

Calculations

Now we have the data, and the distance formula, we can construct the data.table join

library(data.table)

## set the tram stop data as a data.table
dt1 <- as.data.table(tram_stops)

## add a column that will be used to do the join on
dt1[, joinKey := 1]

## find the dinstance between each point to every other point
## by joining the data to itself
dt2 <- dt1[
  dt1
  , {
    idx = dt.haversine(stop_lat, stop_lon, i.stop_lat, i.stop_lon) < 500 ## in metres
    .(stop_id = stop_id[idx],
      near_stop_id = i.stop_id)
  }
  , on = "joinKey"
  , by = .EACHI
]

Result

dt2 now holds two columns of stop_id's that are within 500 metres of each other (including the same stop to itself, so this can be removed)

dt2 <- dt2[stop_id != near_stop_id]

Plot

As we're using googleway, lets plot at some of the results (to do this you need a Google Maps API key, or use another mapping library such as leaflet)

mapKey <- "your_api_key"

## Just pick one to look at
myStop <- 18048
dt_stops <- dt3[stop_id == myStop ]

## get the lat/lons of each stop_id
dt_stops <- dt_stops[
  dt1      ## dt1 contains the lat/lons of all the stops
  , on = c(near_stop_id = "stop_id")
  , nomatch = 0
]

google_map(key = mapKey) %>%
  add_circles(data = dt1[stop_id == myStop], lat = "stop_lat", lon = "stop_lon", radius = 500) %>%
  add_markers(dt_stops, lat = "stop_lat", lon = "stop_lon")

enter image description here

Notes

The data.table join should be pretty efficient, but obviously the data I've used here is only 51 rows; you'll have to let me know how well this method scales to your data

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
0

You may want to consider a different approach. I'd use a GIS tool like QGIS to segment your data. Like you said, you don't need a full cartesian join of the data, just local clusters. Look at some of the clustering questions.

This question over on GIS Stackexchange works through a similar type problem with 800k datapoints. https://gis.stackexchange.com/questions/211106/clustering-points-polygons-based-on-proximity-within-specifed-distance-using-q

WombatPM
  • 2,561
  • 2
  • 22
  • 22
  • thank you for this - this pointed me in the right direction. I used the fixed distance bugger, then used a spatial query in PostGIS to summarize the characteristics of all the points that intersected each of the buffer lines. Ended up just about 30 minutes of processing power compared to the 48h+ of the cartesian joint. – dozyaustin Aug 23 '17 at 03:14