Edit: Real data set available here
With thanks to
Wang, Rui, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. "StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students using Smartphones." In Proceedings of the ACM Conference on Ubiquitous Computing. 2014.
Explanation
I'm running a simulation study in which I am performing stop detection on location data (lat/lon coordinates) based on relatively simple criteria.
A location (A) is a stop if there exists another location (B) with a timestamp of at least 180 seconds after A, and if all locations between A and B inclusive have a distance from A less than or equal to 80 meters.
I've tried to reduce the data such that it still works but doesn't require actual coordinates.
data <- data.table(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
latlon = c(0, 50, 80, 90, 90, 100, 190, 110, 110, 110),
time = c(0, 60, 120, 180, 240, 300, 360, 420, 480, 520))
id 1
isn't a stop, because the first location with a difference in time > 180 (id 5
) has a distance in latlon of 90.
id 2
is a stop, because all locations between itself and the first location with a difference in time > 180 (id 6
) have a distance less than 80 (0, 30, 40, 40, 50).
id 6
is not a stop because even though id 10
is > 180 difference in time, id 7
which falls between has a distance greater than 80.
id 8
is not a stop because there is no location afterwards at least 180 seconds following.
Ultimately, I need to be able to assign "stop id" greedily such if I find for example that id 2
has points that satisfy the distance requirement through until id 7
, locations with ids 2:7 have a stop id of 2.
Matrix and for loop
If I run this:
nrows <- nrow(data)
latlon_dist <- outer(data$latlon, data$latlon, `-`)
latlon_dist[upper.tri(latlon_dist)] <- NA
time_window <- outer(data$time, data$time, `-`)
time_window[upper.tri(time_window)] <- NA
foo <- function(x){
mindist <- min(which(x[, 1] > 80), nrows)
if (mindist >= min(which(x[, 2] > 180), nrows + 1)) mindist else NA
}
bar <- array(c(latlon_dist, time_window),
dim = c(nrows, nrows, 2))
apply(bar, 2, foo)
It gives me back the threshholds > NA 7 7 NA NA NA NA NA NA NA
which I can use in a for loop to set the stop id as appropriate.
threshholds <- apply(bar, 2, foo) - 1
previous_threshhold <- 0
for (i in seq_along(threshholds)) {
current_threshhold <- threshholds[i]
if (!is.na(current_threshhold) && current_threshhold > previous_threshhold) {
data[i:current_threshhold, stop_id := i]
previous_threshhold <- current_threshhold
}
}
At this point, this is the only way I've been able to guarantee accuracy. Everything else I've tried I've thought was correct, only to find that it wasn't behaving identically to this situation. But this is, as you might imagine, horribly inefficient, and it's being run 116,000 times in my simulation study.
My assumption is that the best way to handle this is with a non-equi join in data.table.
The other implementation I'm currently running functions better when the number of rows in the data set make the array too memory-heavy. I won't translate this to work on the data, but it's here in case it gives anyone ideas. I've stuck it in a while loop so it can skip over some iterations when it's already assigned a stop_id to a number of points. If points 1:7 all belong to stop_id 1, they aren't considered candidate stops themselves, we just move on to testing again at point 8. It technically returns a different solution, but stops that are "close enough" are merged later in this process, so the final result is unlikely to differ much.
For loop, no matrix
stopFinder <- function(dt){
nrows <- nrow(dt)
if (nrows < 20000){
return(quickStopFinder(dt))
}
i <- 1
remove_indices <- 0
while (i < nrows) {
this_ends <- dt[!remove_indices,
Position(
geodist_vec(rep(longitude[1], .N),
rep(latitude[1], .N),
longitude,
latitude,
paired = TRUE),
f = function(x) x > 80,
nomatch = .N + 1) ] + i - 1
# A) Do some number of points occur within the distance?
# B) If so, is it at least three minutes out?
if (this_ends > (i + 1) && dt[c(i, (this_ends - 1)), timestamp[.N] > time_window[1]]) {
# Last index is the one before the distance is broken
last_index_of_stop <- this_ends - 1
# Next run, we will remove all prior considerations
remove_indices <- c(1:last_index_of_stop)
# Set the point itself
dt[i,
`:=`(candidate_stop = TRUE,
stop_id = id,
within_stop = TRUE)]
# Set the attached points
dt[(i + 1):last_index_of_stop,
`:=`(within_stop = TRUE,
stop_id = i)]
# Start iterating again on the point that broke the distance
i <- this_ends
} else {
# If no stop, move on and leave out this point
remove_indices <- c(1:i)
i <- i + 1
}
}
dt[]
}
quickStopFinder is more or less the implementation I share at the beginning, which is memory intensive and slow, but slightly less slow than stopFinder.
Previously, I had something like this as the basis, but it required a lot of subsequent steps and didn't always give me the results I was looking for, but I'll add it for posterity.
res <- dt[dt,
on = .(timestamp >= timestamp_dup,
timestamp <= time_window)]
res[, dist := geodist_vec(x1 = longitude,
y1 = latitude,
x2 = i.longitude,
y2 = i.latitude,
paired = TRUE,
measure = "haversine")]
res[, candidate_stop := all(dist <= 80), i.id]
New with real data
Edit with example from real data:
This handles the situation with joins, but grows too big too quickly. It is fast when the data are small.
sm2 <- read.csv(file = "http://daniellemc.cool/sm.csv", row.names = NULL)
sm <- copy(sm2)
setDT(sm)
sm <- sm[, .(timestamp, longitude, latitude, id)]
sm[, timestamp := as.POSIXct(timestamp)]
sm[, id2 := id]
# This is problematic on my data because of how quickly it grows.
test <- sm[sm, on = .(id >= id)]
test[, i.id2 := NULL]
setnames(test, c("time.2", "longitude.2", "latitude.2", "id.1",
"id.2", "time.1", "longitude.1", "latitude.1"))
# Time and distance differences calculated between each pair
test[, distdiff := geodist_vec(longitude.1, latitude.1,
longitude.2, latitude.2,
paired = TRUE)]
test[, timediff := time.2 - time.1]
# Include the next distance to make sure there's at least one within distance and
# over 180 timediff.
test[, nextdistdiff := shift(distdiff, -1), id.1]
# Are all distances within 180 sec within 80, and is the next following also < 80
test[, dist_met := FALSE]
test[timediff < 180, dist_met := all(distdiff < 80 & nextdistdiff < 80), id.1]
test[, dist_met := any(dist_met), id.1]
# Test how many occur consecutively
# This keeps us from having > 80 dist but then coming back within 80
test[, consecutive := FALSE]
test[distdiff < 80, consecutive := c(TRUE, cummin(diff(id.2) == 1) == 1), id.1]
test[consecutive == TRUE & dist_met == TRUE, stop_id := min(id.1), id.2]
test[test[consecutive == TRUE & dist_met == TRUE], stop_id := i.stop_id, on = .(id.1 = id.2)]
test <- unique(test[, .(stop_id, id.1)])
# Join it back to the data.
sm[test, stop_id := stop_id, on = .(id = id.1)]