1

I would like to calculate the difference between times in R for each observation given that it concerns bus arrivals and timetables. The code i have so far is:

for (i in ida1d$DATA_TRAMA) {
   for (j in horidat$CORD4) {
       if((ida1d$DATA_TRAMA - horidat$CORD4 < ida1d$diff)) {
           ida1d$diff <- ida1d$DATA_TRAMA - horidat$CORD4
       }                    
   }            
}

I have these dataframes:

  • ida1d which has the specific information I want
  • horidat which has the timetables
  • both ida1d$DATA_TRAMA and horidat$CORD4 are in POSIXct %Y-m%-%d %H:%M:%S format.

Now what I was trying to do is to have a column in the ida1d table with the differences. The problem is that for the first few observations it calculates the differences correctly, but from a certain point on it doesn't, the number of observations for horidat (timetable) is 75 and for the ida1d is 88 I think the problem with the calculations could be because of this, thus trying to approach the problem with for loops, but I think i'm missing something...

The date and time for ida1d is the time that the bus arrived at the bus stop throughout the day :

ida1d$DATA_TRAMA
     [1] 2010-10-01 00:00:08 
         2010-10-01 00:29:45 
         2010-10-01 06:22:56            
         2010-10-01 06:38:55 
         2010-10-01 06:52:41 
         2010-10-01 07:05:08 
         2010-10-01 07:15:17 
         2010-10-01 07:25:14
         2010-10-01 07:38:25 
         2010-10-01 07:44:55 
         2010-10-01 07:54:44
         2010-10-01 08:05:05 
         2010-10-01 08:14:43 
         2010-10-01 08:24:11 
         2010-10-01 08:33:29 
         2010-10-01 08:46:26
         2010-10-01 08:54:40 
         2010-10-01 09:04:34
         2010-10-01 09:14:53 

And this is the timetable for the bus (horidat)

horidat$CORD4
 [2] 2010-10-01 00:00:00 
     2010-10-01 00:30:00 
     2010-10-01 06:25:00  
     2010-10-01 06:45:00 
     2010-10-01 07:00:00 
     2010-10-01 07:15:00 
     2010-10-01 07:30:00 
     2010-10-01 07:45:00 
     2010-10-01 07:57:00 
     2010-10-01 08:09:00 
     2010-10-01 08:21:00 
     2010-10-01 08:32:00 
     2010-10-01 08:43:00 
     2010-10-01 08:54:00 
     2010-10-01 09:06:00 
     2010-10-01 09:18:00  

And i can calculate the difference between [1] and [2] as we can see below the results [3]:

ida1d$diff
Time differences in secs
 [3]      8    
         -15   
         -124  
         -365   
         -439   
         -592   
         -883 
         -1186 
         -1115 
         -1445  
         -1576  
         -1615  
         -1697  
         -1789  
         -1951  
         -1894  
         -2120  
         -2246 

As we can see it starts off well in beginning and then it gets wrong the rest of the calculations and that is my problem, and i wanted to get the smallest value of the difference to be in the column for each observation, perhaps it uses other values and thus the wrong calculations i guess...

Hanna
  • 51
  • 7
  • 1
    Possible duplicate of [How to make time difference in same units when subtracting POSIXct](https://stackoverflow.com/questions/30510044/how-to-make-time-difference-in-same-units-when-subtracting-posixct) – divibisan Jan 23 '19 at 20:46
  • 1
    How do you want the 88 times in `ida1d$data` to relate to the 75 times in `horidat$CORD4`? Does observation # x in one always related to observation # x in the other? If not, you probably need to do some sort of join first between the two tables. It would be helpful to see an example of a few rows of each table and how you want them to relate. – Jon Spring Jan 23 '19 at 21:01
  • You have your iterators `i` and `j` but never seem to use them. So your `if` statement isn't actually inside a loop and is just operating on the vectors. –  Jan 23 '19 at 21:43
  • @JonSpring, already added more details – Hanna Jan 23 '19 at 22:12
  • Do you want to compare each arrival time to the closest scheduled time, or the most recent scheduled time, or the same # stop (as currently set up), or something else? – Jon Spring Jan 23 '19 at 22:33
  • @JonSpring i want to compare each time arrival with each scheduled time – Hanna Jan 23 '19 at 22:36
  • Does that mean the 9:14:53 arrival (19th of the day) should be compared to the 9:18 scheduled stop (16th of the day), since they have the smallest difference? – Jon Spring Jan 23 '19 at 22:39
  • @JonSpring, yes, the problem is that the calculation for those times gives the value of 2246 seconds which is like 30 min :/ it should be 4 min or so and i wanted to do the same thing for each record – Hanna Jan 23 '19 at 22:44

2 Answers2

2

I was going to comment but don't have the reputation. Looking for a few points.

It looks like one table has events and another has some kind of attributes? If you could provide detail on what's in each table that'd help.

In any case I'll sketch out roughly how I might approach it. Generally you want to avoid for-loops. You can use joins or vector operations which will be more performant. The solution seems like it'll be a join between the two dataframes with a mutate to create the difference column. Let me know otherwise if this is not the case!

# install these if you don't have them
library(dplyr)
library(magrittr)

horidat %>%
  xxxx_join(ida1d, on=c('some_column1','some_column2') %>%
  mutate(diff = as.numeric(difftime(DATA_TRAMA, CORD4, units='secs')))

xxxx_join is the appropriate join e.g. left_join.

functional
  • 21
  • 3
2

Here's an approach using the fuzzyjoin package. There are faster approaches using data.table, like here and here, but for me this approach is easier to follow and makes it easier to tweak the matching rule, if necessary.

fuzzyjoin::difference_left_join works like merge in base R or left_join in dplyr, creating a database-style "join" between two data sets. In this case, it connects the respective time stamps in your two tables. As a "left" join, it keeps at least one copy of each row of the original arrivals data frame, and includes one row for each matching CORD4 scheduled stops that are close enough in time to that DATA_TRAMA time. In this case, by setting max_dist = 15*60, we get all scheduled stops within 15 minutes (15 min x 60 sec) of the arrival time. This is probably more matches than you want (41 close-enough matches for 19 arrival times), but it leaves you some flexibility to decide which scheduled stops should be tied to which actual arrivals.

In this case, to pick the scheduled stop closest in time, I use group_by and top_n from dplyr to keep only the row with the smallest absolute time difference for each arrival.

edit: added signed difference and limited join to one column

library(dplyr); library(fuzzyjoin)

ida1d %>% 
  # select() to bring in only CORD4 from horidat
  difference_left_join(horidat %>% select(CORD4), 
                   by = c(DATA_TRAMA = "CORD4"),
                   max_dist = 15*60, distance_col = "abs_dif") %>%
  # difference_left_join based on absolute differences
  # add signed difference
  mutate(difference = DATA_TRAMA - CORD4) %>%
  # could use filter like this to limit to only late buses:
  # filter(difference >= 0) %>%
  group_by(DATA_TRAMA) %>%
  top_n(-1, wt = difference) %>%
  ungroup()

# A tibble: 19 x 4
   DATA_TRAMA          CORD4               abs_dif  difference
   <dttm>              <dttm>              <time>   <time>    
 1 2010-10-01 00:00:08 2010-10-01 00:00:00   8 secs    8 secs 
 2 2010-10-01 00:29:45 2010-10-01 00:30:00  15 secs  -15 secs 
 3 2010-10-01 06:22:56 2010-10-01 06:25:00 124 secs -124 secs 
 4 2010-10-01 06:38:55 2010-10-01 06:45:00 365 secs -365 secs 
 5 2010-10-01 06:52:41 2010-10-01 07:00:00 439 secs -439 secs 
 6 2010-10-01 07:05:08 2010-10-01 07:00:00 308 secs  308 secs 
 7 2010-10-01 07:15:17 2010-10-01 07:15:00  17 secs   17 secs 
 8 2010-10-01 07:25:14 2010-10-01 07:30:00 286 secs -286 secs 
 9 2010-10-01 07:38:25 2010-10-01 07:45:00 395 secs -395 secs 
10 2010-10-01 07:44:55 2010-10-01 07:45:00   5 secs   -5 secs 
11 2010-10-01 07:54:44 2010-10-01 07:57:00 136 secs -136 secs 
12 2010-10-01 08:05:05 2010-10-01 08:09:00 235 secs -235 secs 
13 2010-10-01 08:14:43 2010-10-01 08:09:00 343 secs  343 secs 
14 2010-10-01 08:24:11 2010-10-01 08:21:00 191 secs  191 secs 
15 2010-10-01 08:33:29 2010-10-01 08:32:00  89 secs   89 secs 
16 2010-10-01 08:46:26 2010-10-01 08:43:00 206 secs  206 secs 
17 2010-10-01 08:54:40 2010-10-01 08:54:00  40 secs   40 secs 
18 2010-10-01 09:04:34 2010-10-01 09:06:00  86 secs  -86 secs 
19 2010-10-01 09:14:53 2010-10-01 09:18:00 187 secs -187 secs

Sample data:

ida1d = data.frame(DATA_TRAMA = as.POSIXct(
  c(
    "2010-10-01 00:00:08",
    "2010-10-01 00:29:45",
    "2010-10-01 06:22:56",
    "2010-10-01 06:38:55",
    "2010-10-01 06:52:41",
    "2010-10-01 07:05:08",
    "2010-10-01 07:15:17",
    "2010-10-01 07:25:14",
    "2010-10-01 07:38:25",
    "2010-10-01 07:44:55",
    "2010-10-01 07:54:44",
    "2010-10-01 08:05:05",
    "2010-10-01 08:14:43",
    "2010-10-01 08:24:11",
    "2010-10-01 08:33:29",
    "2010-10-01 08:46:26",
    "2010-10-01 08:54:40",
    "2010-10-01 09:04:34",
    "2010-10-01 09:14:53"
  )
))

horidat = data.frame(CORD4 = as.POSIXct(
  c(
    "2010-10-01 00:00:00",
    "2010-10-01 00:30:00",
    "2010-10-01 06:25:00",
    "2010-10-01 06:45:00",
    "2010-10-01 07:00:00",
    "2010-10-01 07:15:00",
    "2010-10-01 07:30:00",
    "2010-10-01 07:45:00",
    "2010-10-01 07:57:00",
    "2010-10-01 08:09:00",
    "2010-10-01 08:21:00",
    "2010-10-01 08:32:00",
    "2010-10-01 08:43:00",
    "2010-10-01 08:54:00",
    "2010-10-01 09:06:00",
    "2010-10-01 09:18:00"
  )
))
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • hi @JonSpring, thank you very much!! it almost works, since i have more columns in both data frames it joins all the columns in one data frame, is it possible to just have a column with the difference in ida1d? And the times are almost correct as well but there are certain cases where the bus is ahead of the schedule, is there a way to make that distinction, like the delay is positive and being ahead is negative? thank you very much!! – Hanna Jan 24 '19 at 12:11
  • Updated my answer. Yes, you can specify which columns are brought into the join, done here with `select`. The join here brings in everything within the specific absolute difference. I added a column after that shows the relative difference, with positive numbers for late buses and negative for early buses. You can use those two numbers to filter and rank to get the match that is most appropriate to your application. – Jon Spring Jan 24 '19 at 20:36
  • hi @Jon Spring, once again thank you very much!! This seems to be working so far and does what i was thinking of! Thank you!!! – Hanna Jan 25 '19 at 10:38