3

EDIT 30.09.2019: Added examples and modified code for reproducibility.

EDIT 01.10.2019: Better Explanation.

I'm merging two different datasets of a transportation system. One (df1) is the schedules at which trains should be at a certain station, the other (df2) is the observations of trains passing by those stations.

Since it's common that some schedules can't be fulfilled (a train broke, services need to be removed, construction sites, etc etc), the observations dataset contains fewer entries than the schedules one.

My goal is to measure the delays of the realized schedules, from a passenger perspective. This means: if a passenger knows at what time the train should be departing a station, the delay he sees corresponds to the train leaving at the exact scheduled time (delay = 0) or the immediate next train (delay = depart_observed - depart_scheduled).
The goal is translated into the following task: merge both datasets by:
-stop_id: Since I need to match observations for each station, this is obviously the first matching criteria.
-timestamp: for every schedule timestamp I should find the immediate consecutive observation timestamp.
The result should show all possible schedules (unique, no repetitions) matched with the closest following observation (unique too, no repetition). Again, I need to make sure that the way the merging is performed, only observations that happen AFTER or SIMULTANEOUSLY to the schedule are matched with the schedule's proper timestamp. Given the size difference between both datasets, I'm expecting plenty NAs in the observed column, since the schedules can't be paired with an observation.

For simplicity, the provided examples contain only the two columns I'm using for the join:

stop_id (id for the station) and timestamp (date & time at which the trains should be or were observed at the stations).

The approach I'm using is using a rolling join with Data.Table in R. This works great, except for the fact that whenever I'm creating the join, I keep getting duplicates of one dataset, and I can't have those repetitions in the merged set.

The code I'm using:

#DECLARING FUNCTIONS (Two functions according to df1[df2] and df2[df1])

merge_schedule_obs <- function(df1, df2) {
  setDT(df1)
  setDT(df2)
  max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
  setkey(df1, stop_id, departure)[, departScheduled:=departure]
  df3 <- df1[df2, roll = max_delay]
  return(df3)
}

merge_schedule_obs2 <- function(df1, df2) {
  setDT(df1)
  setDT(df2)
  max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
  setkey(df1, stop_id, departure)[, departObserved:=departure]
  df3 <- df1[df2, roll = -max_delay]
  return(df3)
}

#CREATING SOME SAMPLE DATA (The real dataset is significantly larger. But this example should cover the problem)


#Sample Schedule Data:

df1 <- distinct(data.table(stop_id = as.factor(c("70015","70009", "70003", "70019","70013", "70007", "70019", "70005", "70007", "70019", "70005")),
                           departure = as.POSIXct(c("2019-09-09 06:57:00","2019-09-09 06:57:00", "2019-09-09 06:57:00",
                                                    "2019-09-09 06:54:30","2019-09-09 06:54:00", "2019-09-09 06:55:00",
                                                    "2019-09-09 06:55:30","2019-09-09 06:55:00", "2019-09-09 06:55:10",
                                                    "2019-09-09  06:55:00", "2019-09-09  06:58:00"))))



Out:

    stop_id           departure
 1:   70015 2019-09-09 06:57:00
 2:   70009 2019-09-09 06:57:00
 3:   70003 2019-09-09 06:57:00
 4:   70019 2019-09-09 06:54:30
 5:   70013 2019-09-09 06:54:00
 6:   70007 2019-09-09 06:55:00
 7:   70019 2019-09-09 06:55:30
 8:   70005 2019-09-09 06:55:00
 9:   70007 2019-09-09 06:55:10
10:   70019 2019-09-09 06:55:00
11:   70005 2019-09-09 06:58:00

#Sample Observations Data:

df2 <- distinct(data.table(stop_id = as.factor(c("70013","70009", "70015", "70005", "70005", "70007", "70019")),
                           departure = as.POSIXct(c("2019-09-09 06:57:10","2019-09-09 07:00:17", "2019-09-09 07:00:12",  "2019-09-09  06:58:20", "2019-09-09  06:58:00",
                                                    "2019-09-09 06:57:30", "2019-09-09 06:57:35")))
)

Out:

   stop_id           departure
1:   70013 2019-09-09 06:57:10
2:   70009 2019-09-09 07:00:17
3:   70015 2019-09-09 07:00:12
4:   70005 2019-09-09 06:58:20
5:   70005 2019-09-09 06:58:00
6:   70007 2019-09-09 06:57:30
7:   70019 2019-09-09 06:57:35
#MERGING DATASETS: (Both directions are covered, and the problem shows in both)

merged_df <- distinct(na.omit(merge_schedule_obs(df1,df2))) 

Out:

   stop_id           departure     departScheduled
1:   70005 2019-09-09 06:58:00 2019-09-09 06:58:00
2:   70005 2019-09-09 06:58:20 2019-09-09 06:58:00
3:   70007 2019-09-09 06:57:30 2019-09-09 06:55:10
4:   70009 2019-09-09 07:00:17 2019-09-09 06:57:00
5:   70013 2019-09-09 06:57:10 2019-09-09 06:54:00
6:   70015 2019-09-09 07:00:12 2019-09-09 06:57:00
7:   70019 2019-09-09 06:57:35 2019-09-09 06:55:30

merged_df2 <- distinct(na.omit(merge_schedule_obs2(df2,df1))) 

Out:
    stop_id           departure      departObserved
 1:   70005 2019-09-09 06:55:00 2019-09-09 06:58:00
 2:   70005 2019-09-09 06:58:00 2019-09-09 06:58:00
 3:   70007 2019-09-09 06:55:00 2019-09-09 06:57:30
 4:   70007 2019-09-09 06:55:10 2019-09-09 06:57:30
 5:   70009 2019-09-09 06:57:00 2019-09-09 07:00:17
 6:   70013 2019-09-09 06:54:00 2019-09-09 06:57:10
 7:   70015 2019-09-09 06:57:00 2019-09-09 07:00:12
 8:   70019 2019-09-09 06:54:30 2019-09-09 06:57:35
 9:   70019 2019-09-09 06:55:00 2019-09-09 06:57:35
10:   70019 2019-09-09 06:55:30 2019-09-09 06:57:35

The result of running the rolling join with the datasets df1 and df2 create a duplicate entry of the schedule in station 70005 (if I run the join the other way around df1[df2, roll = max_delay, ...] the duplicates happen in the observation instead, in stations 70005 and 70019). Basically, I can't get rid of them. I tried the options with mult (first, last) and trying to play with rollends... still always unwanted duplicates...

How can I merge these datasets without any duplicates?

Thank you!

Gonz87
  • 33
  • 3

1 Answers1

2

First of all, you could use unique instead of distinct (the latter presumably from dplyr; you don't specify) to avoid coercing the data table to a data frame.

You were pretty close, but you need to switch the tables in the join, i.e. something like df2[df1], so that the rows from df1 are used as search keys, and then you can use mult to remove duplicates.

Here's one way to do what you want with a non-equi join:

setkey(df1, departure)
setkey(df2, departure)

df1[, max_departure := departure + as.difftime(1, units = "hours")
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

We order by departure (via setkey) so that mult = "first" returns the closest match in the future within what's allowed. The intermediate column max_departure has to be assigned and subsequently removed because non-equi joins can only use existing columns. Also note that the syntax used takes from this answer (the version with .SD instead of df1 doesn't work in this case, I don't know why).


EDIT: based on the comments, it occurs to me that when you say "duplicated", you might be referring to something different. Say you have planned departures at 10 and 10:30, but the one at 10 never takes place, and an observed departure is 10:31. Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30, and cannot be used for the one at 10? If that's the case, perhaps this will work:

setkey(df1, departure)
setkey(df2, departure)

max_dep <- function(departure) {
  max_departure <- departure + as.difftime(1, units = "hours")

  next_departure <- shift(departure,
                          fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
                          type = "lead")

  invalid_max <- max_departure >= next_departure

  max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
  max_departure
}

df1[, max_departure := max_dep(departure), by = "stop_id"
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

The max_dep helper checks, for each stop and scheduled departure, what would be the next scheduled departure, and sets max_departure as "next minus 1 second" if the next departure is within one hour.

The other solution wouldn't work for this because, as long as an observed departure falls within one hour of the scheduled one, it is a valid option. In my example that means 10:31 would be valid for both 10:30 and 10.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Niiceee!! Thank you sooo much it worked perfectly! I'm still running some tests, but so far, awesome! – Gonz87 Oct 02 '19 at 12:14
  • Ok, I ran some tests and in the large dataset there are still some repeated values in the observed_departure which I really can't figure out... I don't think it's something reproducible in the small dataset I provided... If there is a way to attach the data, I could show you what I mean... either way, thank you!! – Gonz87 Oct 02 '19 at 12:17
  • I found out what is responsible for the duplication behavior... basically adding a "one hour window" **here: max_departure: = departure + as.difftime(1, units = "hours")** for matching the datasets creates the repetition. The more I reduce this value, the fewer repetitions I have, and the more sparse my dataset looks at the end (which is the correct, expected behavior) – Gonz87 Oct 02 '19 at 12:26
  • @Gonz87 I'm not sure if what you're seeing is a result of `data.table`. The syntax used prevents extra rows from being accepted, because adding by reference to `df1` means that the incoming vector (`x.departure` here) must have the same number of elements as there are rows. Try removing `mult` from the call, it should give you an error. – Alexis Oct 02 '19 at 16:40
  • @Gonz87 I think there might be a misunderstanding here, can you check my edit? – Alexis Oct 02 '19 at 17:08
  • Alexis, thanks for your comment and edit! That's exactly what I meant. I have thought more about the problem, though, and your approach makes way more sense. Since I'm only interested in what the passenger sees. Say departure happens at 10:31 like in your example... then there are two options: If he arrived for the 10:00 schedule, he sees a 31-minute delay. If he arrives for the 10:30 schedule, he sees a 1-minute delay... the duplicates actually make more sense, and my problem is better approached like that. Thanks a lot!!! – Gonz87 Oct 04 '19 at 08:45