2

I have two dataframes (GPSData and ACCData), they are coming from the same dataframe but I used a fonction for only ACCData. Now, I would like to join them. But somethimes, I don't have the same numbers of rows so I can't join them and be sure time lines correspond.

This GPSData df :

device_id         UTC_datetime    UTC_date  UTC_time   datatype 

182197  2020-02-19 13:53:03  2020-02-19  13:53:03      GPS       
182197  2020-02-19 13:53:28  2020-02-19  13:53:28      GPS        
182197  2020-02-19 13:53:59  2020-02-19  13:53:59      GPS       
182197  2020-02-19 13:54:30  2020-02-19  13:54:30      GPS       
182197  2020-02-19 13:54:58  2020-02-19  13:54:58      GPS      
182197  2020-02-19 13:55:28  2020-02-19  13:55:28      GPS     

This is ACCData df :

 start.timestamp  burst.id  axis.sample  burst.duration
2020-02-19 13:53:04         0          200              9  
2020-02-19 13:53:29         1          200              9  
2020-02-19 13:54:00         2          200              9  
2020-02-19 13:54:31         3          200              9  
2020-02-19 13:54:59         4          200              9  
2020-02-19 13:55:29         5          200              9  

I would like a code line who says that in GPSData keep only lines who correspond to a burst of ACCData, who have a UTC_Datetime close to start.timesstamp in ACCData, most of the time the burst of ACCData start 1 or 2 seconds after the GPS position. Hope you can help me, it will help a lot in my study. Thanks !

  • Does this answer your question? [Merging Data on date time column (POSIXct format)](https://stackoverflow.com/questions/57932570/merging-data-on-date-time-column-posixct-format) – Scipione Sarlo Apr 01 '20 at 15:17
  • This is not a duplicate of the above question, because it needs a rolling join. It is sort of a duplicate of [Merging two sets of data by data.table roll='nearest' function](https://stackoverflow.com/questions/54013468/merging-two-sets-of-data-by-data-table-roll-nearest-function). – Ian Campbell Apr 01 '20 at 15:43

1 Answers1

1

This is a seemingly hard problem, but it is easily achieved using a rolling join in data.table.

library(data.table)
library(fasttime)
setDT(GPSData)
setDT(ACCData)
GPSData[,Time := fastPOSIXct(UTC_datetime)]
ACCData[,Time := fastPOSIXct(start.timestamp)]
Result <- GPSData[ACCData,on = "Time", roll="nearest"]
Result[,.(device_id,UTC_datetime,start.timestamp,burst.id,axis.sample,burst.duration)]
#   device_id        UTC_datetime     start.timestamp burst.id axis.sample burst.duration
#1:    182197 2020-02-19 13:53:03 2020-02-19 13:53:04        0         200              9
#2:    182197 2020-02-19 13:53:28 2020-02-19 13:53:29        1         200              9
#3:    182197 2020-02-19 13:53:59 2020-02-19 13:54:00        2         200              9
#4:    182197 2020-02-19 13:54:30 2020-02-19 13:54:31        3         200              9
#5:    182197 2020-02-19 13:54:58 2020-02-19 13:54:59        4         200              9
#6:    182197 2020-02-19 13:55:28 2020-02-19 13:55:29        5         200              9

After the join you can compute the difference in UTC_datetime and start.timestamp to filter out rows that are too far apart.

Result[,`:=`(UTC_date = NULL, UTC_time = NULL, datatype = NULL, Time = NULL, UTC_datetime = fastPOSIXct(UTC_datetime), start.timestamp = fastPOSIXct(start.timestamp))]
Result[,Diff := abs(UTC_datetime - start.timestamp)]
Result[Diff < 5,]
#   device_id        UTC_datetime     start.timestamp burst.id axis.sample burst.duration   Diff
#1:    182197 2020-02-19 08:53:03 2020-02-19 08:53:04        0         200              9 1 secs
#2:    182197 2020-02-19 08:53:28 2020-02-19 08:53:29        1         200              9 1 secs
#3:    182197 2020-02-19 08:53:59 2020-02-19 08:54:00        2         200              9 1 secs
#4:    182197 2020-02-19 08:54:30 2020-02-19 08:54:31        3         200              9 1 secs
#5:    182197 2020-02-19 08:54:58 2020-02-19 08:54:59        4         200              9 1 secs
#6:    182197 2020-02-19 08:55:28 2020-02-19 08:55:29        5         200              9 1 secs

If you are trying to do it by device_id, you can join on multiple columns, but the last value can be rolling.

GPSData[ACCData,on = c("device_id","Time"), roll="nearest"]

Data

GPSData <- structure(list(device_id = c(182197L, 182197L, 182197L, 182197L, 
182197L, 182197L), UTC_datetime = structure(1:6, .Label = c("2020-02-19 13:53:03", 
"2020-02-19 13:53:28", "2020-02-19 13:53:59", "2020-02-19 13:54:30", 
"2020-02-19 13:54:58", "2020-02-19 13:55:28"), class = "factor"), 
    UTC_date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2020-02-19", class = "factor"), 
    UTC_time = structure(1:6, .Label = c("13:53:03", "13:53:28", 
    "13:53:59", "13:54:30", "13:54:58", "13:55:28"), class = "factor"), 
    datatype = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "GPS", class = "factor"), 
    Time = structure(c(1582120383, 1582120408, 1582120439, 1582120470, 
    1582120498, 1582120528), class = c("POSIXct", "POSIXt"))), class = "data.frame", row.names = c(NA, 
-6L))
ACCData <- structure(list(start.timestamp = structure(1:6, .Label = c("2020-02-19 13:53:04", 
"2020-02-19 13:53:29", "2020-02-19 13:54:00", "2020-02-19 13:54:31", 
"2020-02-19 13:54:59", "2020-02-19 13:55:29"), class = "factor"), 
    burst.id = 0:5, axis.sample = c(200L, 200L, 200L, 200L, 200L, 
    200L), burst.duration = c(9L, 9L, 9L, 9L, 9L, 9L), Time = structure(c(1582120384, 
    1582120409, 1582120440, 1582120471, 1582120499, 1582120529
    ), class = c("POSIXct", "POSIXt"))), class = "data.frame", row.names = c(NA, 
-6L))
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57