Unable to perform rolling join on time and another variable in data.table

Question

I have two data.tables - one with location snapshots (df2), and one with times when rides ended (df1). My goal is to join the very large df2 on df1 where the times match closest and the bike_ids match. I have tried everything in data.table but nothing seems to work.

I am following this example almost verbatim to merge by bike_id and by nearest time, but I end up with empty columns.

sum(is.na(df1$time))
# [1] 0
sum(is.na(df2$time))
# [1] 0

sapply(df1, class)
# $time
# [1] "POSIXct" "POSIXt" 

# $bike_id
# [1] "integer"

sapply(df2, class)
# $time
# [1] "POSIXct" "POSIXt" 

# $bike_id
# [1] "integer"

# $st_x
# [1] "numeric"

# $st_y
# [1] "numeric"

all(df1$bike_id %in% df2$bike_id)
# [1] TRUE

Neither of the tables have NA in time value, and both contain the same bike_ids.

For example:

head(df1)
#                   time bike_id
# 1: 2017-12-20 07:29:22   16465
# 2: 2017-12-20 15:34:36   16465
# 3: 2017-12-09 22:08:56   16468
# 4: 2017-12-10 12:38:53   16468
# 5: 2017-12-18 00:19:18   16468
# 6: 2017-12-18 17:29:18   16468


head(df2)
#                   time bike_id      st_x     st_y
# 1: 2017-12-04 16:21:31   16465 -90.07687 29.95264
# 2: 2017-12-04 16:30:05   16465 -90.07687 29.95264
# 3: 2017-12-04 16:40:05   16465 -90.07687 29.95264
# 4: 2017-12-04 16:50:05   16465 -90.07687 29.95264
# 5: 2017-12-04 17:00:06   16465 -90.07687 29.95264
# 6: 2017-12-04 17:10:06   16465 -90.07687 29.95264

The code I am using to match by bike_id and by nearest time is:

setkey(df2, time, bike_id)[, dateMatch:=time]
test <- df2[df1, roll = 'nearest']


head(test)
#                   time bike_id st_x st_y dateMatch
# 1: 2017-12-20 07:29:22   16465   NA   NA      <NA>
# 2: 2017-12-20 15:34:36   16465   NA   NA      <NA>
# 3: 2017-12-09 22:08:56   16468   NA   NA      <NA>
# 4: 2017-12-10 12:38:53   16468   NA   NA      <NA>
# 5: 2017-12-18 00:19:18   16468   NA   NA      <NA>
# 6: 2017-12-18 17:29:18   16468   NA   NA      <NA>

Any ideas as to what could possibly be going wrong? Substituting roll=Inf produces the same result with columns of NAs.

Wouldn't roll='nearest' find at least some date if bike_ids are the same in both tables and time is never NA?

could you add `dput(head(df1))` and `dput(head(df2))` to your post — MichaelChirico, Jan 16 '18 at 00:57

score 0 · Accepted Answer · answered Jan 16 '18 at 04:19

0

This seemed to solve the problem

test = df2[df1, on=.(bike_id, time = time), roll="nearest", 
            .(bike_id, time, st_x, st_y)]

answered Jan 16 '18 at 04:19

iskandarblue

7,208
15
60
130

Unable to perform rolling join on time and another variable in data.table

1 Answers1