Efficient indexing / joining in data.table across multiple dependent conditions for stop detection algorithm

Question

Edit: Real data set available here

With thanks to

Wang, Rui, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. "StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students using Smartphones." In Proceedings of the ACM Conference on Ubiquitous Computing. 2014.

Explanation

I'm running a simulation study in which I am performing stop detection on location data (lat/lon coordinates) based on relatively simple criteria.

A location (A) is a stop if there exists another location (B) with a timestamp of at least 180 seconds after A, and if all locations between A and B inclusive have a distance from A less than or equal to 80 meters.

I've tried to reduce the data such that it still works but doesn't require actual coordinates.

data <- data.table(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                   latlon = c(0, 50, 80, 90, 90, 100, 190, 110, 110, 110),
                   time = c(0, 60, 120, 180, 240, 300, 360, 420, 480, 520))

id 1 isn't a stop, because the first location with a difference in time > 180 (id 5) has a distance in latlon of 90.

id 2 is a stop, because all locations between itself and the first location with a difference in time > 180 (id 6) have a distance less than 80 (0, 30, 40, 40, 50).

id 6 is not a stop because even though id 10 is > 180 difference in time, id 7 which falls between has a distance greater than 80.

id 8 is not a stop because there is no location afterwards at least 180 seconds following.

Ultimately, I need to be able to assign "stop id" greedily such if I find for example that id 2 has points that satisfy the distance requirement through until id 7, locations with ids 2:7 have a stop id of 2.

Matrix and for loop

If I run this:

nrows <- nrow(data)

latlon_dist <- outer(data$latlon, data$latlon, `-`)
latlon_dist[upper.tri(latlon_dist)] <- NA
time_window <- outer(data$time, data$time, `-`)
time_window[upper.tri(time_window)] <- NA

foo <- function(x){
  mindist <- min(which(x[, 1] > 80), nrows)
  if (mindist >= min(which(x[, 2] > 180), nrows + 1)) mindist else NA
}

bar <- array(c(latlon_dist, time_window),
  dim = c(nrows, nrows, 2))


apply(bar, 2, foo)

It gives me back the threshholds > NA 7 7 NA NA NA NA NA NA NA which I can use in a for loop to set the stop id as appropriate.

threshholds <- apply(bar, 2, foo) - 1

previous_threshhold <- 0
for (i in seq_along(threshholds)) {
  current_threshhold <- threshholds[i]
  
  if (!is.na(current_threshhold) && current_threshhold > previous_threshhold) {
    data[i:current_threshhold, stop_id := i]
    previous_threshhold <- current_threshhold
  }
}

At this point, this is the only way I've been able to guarantee accuracy. Everything else I've tried I've thought was correct, only to find that it wasn't behaving identically to this situation. But this is, as you might imagine, horribly inefficient, and it's being run 116,000 times in my simulation study.

My assumption is that the best way to handle this is with a non-equi join in data.table.

The other implementation I'm currently running functions better when the number of rows in the data set make the array too memory-heavy. I won't translate this to work on the data, but it's here in case it gives anyone ideas. I've stuck it in a while loop so it can skip over some iterations when it's already assigned a stop_id to a number of points. If points 1:7 all belong to stop_id 1, they aren't considered candidate stops themselves, we just move on to testing again at point 8. It technically returns a different solution, but stops that are "close enough" are merged later in this process, so the final result is unlikely to differ much.

For loop, no matrix

stopFinder <- function(dt){
  
  nrows <- nrow(dt)
  
  if (nrows < 20000){
    return(quickStopFinder(dt))
  }
  i <- 1
  remove_indices <- 0
  while (i < nrows) {
    this_ends  <- dt[!remove_indices,
                     Position(
                       geodist_vec(rep(longitude[1], .N),
                                   rep(latitude[1], .N),
                                   longitude,
                                   latitude,
                                   paired = TRUE),
                       f = function(x) x > 80,
                       nomatch = .N + 1) ] + i - 1
    # A) Do some number of points occur within the distance?
    # B) If so, is it at least three minutes out?
    if (this_ends > (i + 1) && dt[c(i, (this_ends - 1)), timestamp[.N] > time_window[1]]) {
      # Last index is the one before the distance is broken
      last_index_of_stop <- this_ends - 1
      
      # Next run, we will remove all prior considerations
      remove_indices <- c(1:last_index_of_stop)
      
      # Set the point itself
      dt[i,
         `:=`(candidate_stop = TRUE,
              stop_id = id,
              within_stop = TRUE)]
      # Set the attached points
      dt[(i + 1):last_index_of_stop,
         `:=`(within_stop = TRUE,
              stop_id = i)]
      
      # Start iterating again on the point that broke the distance
      i <- this_ends
    } else {
      # If no stop, move on and leave out this point
      remove_indices <- c(1:i)
      i <- i + 1
    }
  }
  dt[]
}

quickStopFinder is more or less the implementation I share at the beginning, which is memory intensive and slow, but slightly less slow than stopFinder.

Previously, I had something like this as the basis, but it required a lot of subsequent steps and didn't always give me the results I was looking for, but I'll add it for posterity.

  res <- dt[dt,
            on = .(timestamp >= timestamp_dup,
                   timestamp <= time_window)]
  res[, dist := geodist_vec(x1 = longitude,
                            y1 = latitude,
                            x2 = i.longitude,
                            y2 = i.latitude,
                            paired = TRUE,
                            measure = "haversine")]
  res[, candidate_stop := all(dist <= 80), i.id]

New with real data

Edit with example from real data:

This handles the situation with joins, but grows too big too quickly. It is fast when the data are small.

sm2 <- read.csv(file = "http://daniellemc.cool/sm.csv", row.names = NULL)
sm <- copy(sm2)
setDT(sm)
sm <- sm[, .(timestamp, longitude, latitude, id)]
sm[, timestamp := as.POSIXct(timestamp)]
sm[, id2 := id]

# This is problematic on my data because of how quickly it grows.
test <- sm[sm, on = .(id >= id)]
test[, i.id2 := NULL]
setnames(test, c("time.2", "longitude.2", "latitude.2", "id.1",
                 "id.2", "time.1", "longitude.1", "latitude.1"))


# Time and distance differences calculated between each pair
test[, distdiff := geodist_vec(longitude.1, latitude.1,
                               longitude.2, latitude.2,
                               paired = TRUE)]
test[, timediff := time.2 - time.1]

# Include the next distance to make sure there's at least one within distance and 
# over 180 timediff.
test[, nextdistdiff := shift(distdiff, -1), id.1]

# Are all distances within 180 sec within 80, and is the next following also < 80
test[, dist_met := FALSE]
test[timediff < 180, dist_met := all(distdiff < 80 & nextdistdiff < 80), id.1]
test[, dist_met := any(dist_met), id.1]

# Test how many occur consecutively 
# This keeps us from having > 80 dist but then coming back within 80
test[, consecutive := FALSE]
test[distdiff < 80, consecutive :=  c(TRUE, cummin(diff(id.2) == 1) == 1), id.1]
test[consecutive == TRUE & dist_met == TRUE, stop_id := min(id.1), id.2]
test[test[consecutive == TRUE & dist_met == TRUE], stop_id := i.stop_id, on = .(id.1 = id.2)]
test <- unique(test[, .(stop_id, id.1)])

# Join it back to the data.
sm[test, stop_id := stop_id, on = .(id = id.1)]

Waldi · Answer 1 · 2021-04-26T21:25:00.620

0

Using non-equi joins capabilities of data.table, you could join the data to itself while avoiding a cartesian product which would be too expensive.

As data.table only allows >, < or =, joins are first done on rectangular areas, before filtering out the appropriate distances. On the real data you provided, this makes 7 times less calculations.

library(data.table)
library(geosphere)

data <- copy(sm)

minduration <- 180
maxdistance <- 80

data[,latmin := destPoint(cbind(longitude,latitude),b = 180, d=maxdistance)[,2]]
data[,latmax := destPoint(cbind(longitude,latitude),b = 0 , d=maxdistance)[,2]]

data[,lonmin := destPoint(cbind(longitude,latitude),b = 270, d=maxdistance)[,1]]
data[,lonmax := destPoint(cbind(longitude,latitude),b = 90, d=maxdistance)[,1]]

data[,timestampmin := timestamp+minduration]

# Cross product with space and time windows
cross <- data[data,.(i.id,x.id,i.latitude,i.longitude,x.latitude,x.longitude,dist = distGeo(cbind(x.longitude,x.latitude),cbind(i.longitude,i.latitude)) ,i.timestamp,x.timestamp) 
              ,on=.(timestamp>timestampmin,
                    longitude >= lonmin,
                    longitude<=lonmax,
                    latitude >= latmin,
                    latitude <= latmax)][
                    dist<maxdistance]

# Summarizing the results
cross[,.(keep=cumsum(fifelse(diff(x.id-i.id)==1,1,NA_integer_))),by=i.id][
      !is.na(keep),.(startid = i.id,nextid = i.id+keep)][
      !(startid %in% nextid)][
      ,.(maxid=max(nextid)),by=startid][
      ,.(stopid = min(startid)),by=maxid]

     maxid stopid
  1:     6      1
  2:    18     10
  3:    26     22
  4:    33     28
  5:    48     40
 ---             
162:  4273   4269
163:  4276   4274
164:  4295   4294
165:  4303   4301
166:  4306   4305

edited Apr 26 '21 at 21:25

answered Apr 26 '21 at 10:22

Waldi

39,242
6
30
78

Yes, I think this is a time when simplifying the latlon into a single variable for simplicity's sake is problematic. Of course when you have a coordinate pair of latitude and longitude, you can't "window" a radius like you can with time. I could window lat and lon independently, but that would give me a square. I'll update my question with real data. – Danielle McCool Apr 26 '21 at 10:30
2

looking forward for real data : windowing should also be possible on x/y data – Waldi Apr 26 '21 at 11:05
I've updated with real data. Unfortunately, it makes it less predictable, but I don't seem to be able to come up with a good balance because of the nature of this type of data. – Danielle McCool Apr 26 '21 at 15:35
let me know if my edit goes in the right direction. The number of detections seem important, but I think they respect the rules you gave. – Waldi Apr 26 '21 at 18:36
lat/lon min/max are square (https://imgur.com/JRNsxFI) but in theory, since a square contains a circle, it could reduce the number I'm checking. You check true distance regardless. I can't really tell what the stop output is meant to be. If it's B.min, the stops aren't greedy. If id1 forms a potential stop with id2 and id3, and id2 forms a potential stop with id3, id1 grabs 2 and id2 grabs 3, which keeps the stops from aggregating sufficiently. Also B.all suffers from the same problem that it can get prohibatively large because it joins before checking conditions, which I need to avoid – Danielle McCool Apr 26 '21 at 19:09
Stop resolution with the 80 meters and 180 seconds parameters on the data returns 353 stops with stopFinder and 356 stops with the more explicit algorithm pasted above. – Danielle McCool Apr 26 '21 at 19:20
See my update.You're right, this is not yet greedy, I'll check that. – Waldi Apr 26 '21 at 19:41
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231625/discussion-between-danielle-mccool-and-waldi). – Danielle McCool Apr 26 '21 at 20:24
@DanielleMcCool, online in chat – Waldi Apr 26 '21 at 20:48

Efficient indexing / joining in data.table across multiple dependent conditions for stop detection algorithm

Explanation

Matrix and for loop

For loop, no matrix

New with real data

1 Answers1