0

I have a problem with building a logic, to make this work. Couldn't find anything on stack / the web for this specific problem.

I have two data frames:

Data frame one:

ID  Date         Time 
1   2017-11-13   06:34:50
2   2017-11-13   06:40:10
3   2017-11-14   23:58:10 

Data frame two:

Number_Visitors   hit_time 
 20               2017-11-13 06:34:50 
 18               2017-11-13 06:34:50
 15               2017-11-15 00:06:10
 25               2018-12-14 20:58:10

What do I want?

I want to have the Number_Visitors from table two, matching with the Date and Time from table one. But the hardest thing: all visitors between the Date/Time (from table One) + a 10 minutes range (all visitors between start time + 10 minutes).

ID  Date         Time        End_Time #I don't have this column yet.. 
1   2017-11-13   06:34:50    06:44:50
2   2017-11-13   06:40:10    06:50:10   
3   2017-11-14   23:58:10    00:08:10 #Attention: it is one day later here.

Result:

ID  Date         Time        End_Time  Number_of_Visitors_in_range
1   2017-11-13   06:34:50    06:44:50      28
2   2017-11-13   06:40:10    06:50:10      0
3   2017-11-14   23:58:10    00:08:10      15
R overflow
  • 1,292
  • 2
  • 17
  • 37
  • 2
    Please add data using `dput` function, it will be easier to help you. Also check this question: [Data Table merge based on date ranges](https://stackoverflow.com/questions/21560500/data-table-merge-based-on-date-ranges) – pogibas Jul 12 '18 at 09:28
  • Added! First time using dput(), hopefully this is what you meant. – R overflow Jul 12 '18 at 10:59
  • What are the expected results in overlapping periods? Also in your results given the data shouldn't the number of visitors in first row be 38? – UpsideDownRide Jul 12 '18 at 11:48

1 Answers1

2

There are multiple answers possible. non-equi joins / fuzzyjoins are the search terms.

Based on your examples (not the dputs) you could use something like the following. Explanation in the code.

dplyr / fuzzyjoin:

library(dplyr)
library(lubridate)
library(fuzzyjoin)

# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)

# first create 2 new columns so start and end match hit_time in other data.frame
df1 <- df1 %>% mutate(Start_Time = ymd_hms(paste0(Date, Time)),
               End_Time = Start_Time + minutes(10)) 

# use fuzzy join and join everything together and roll up.
fuzzy_left_join(df1, df2, c(Start_Time = "hit_time", End_Time = "hit_time"),
             list(`<=`,`>=`)) %>% 
  group_by(ID, Start_Time, End_Time) %>% 
  summarise(No_Visitors_in_range = sum(Number_Visitors))
# A tibble: 3 x 4
# Groups:   ID, Start_Time [?]
     ID Start_Time          End_Time            No_Visitors_in_range
  <int> <dttm>              <dttm>                             <int>
1     1 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
2     2 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
3     3 2017-11-14 23:58:10 2017-11-15 00:08:10                   15

data.table:

library(data.table)
library(lubridate)

# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)

df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# first create 2 new columns so start and end match hit_time in other data.frame
df1[, Start_Time := ymd_hms(paste0(Date, Time))][, End_Time := Start_Time + minutes(10)]

# add sum of bbb to table 1 from table 2
df1[, No_Visitors_in_range := df2[df1, on=.(hit_time >= Start_Time, hit_time <= End_Time), sum(Number_Visitors), by=.EACHI]$V1]

df1
   ID       Date     Time          Start_Time            End_Time No_Visitors_in_range
1:  1 2017-11-13 06:34:50 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
2:  2 2017-11-13 06:40:10 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
3:  3 2017-11-14 23:58:10 2017-11-14 23:58:10 2017-11-15 00:08:10                   15

data:

df1 <- structure(list(ID = 1:3, Date = c("2017-11-13", "2017-11-13", 
"2017-11-14"), Time = c("06:34:50", "06:40:10", "23:58:10")), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(Number_Visitors = c(20L, 18L, 15L, 25L), hit_time = c("2017-11-13 06:34:50", "2017-11-13 06:34:50", "2017-11-15 00:06:10", "2018-12-14 20:58:10"
)), class = "data.frame", row.names = c(NA, -4L))

EDIT: based on overlapping time frames, it is better to take the lead of the start time.

df1[, End_Time := shift(Start_Time, type = "lead", fill = last(Start_Time))]

# add sum of bbb to table 1 from table 2
df1[, No_Visitors_in_range := df2[df1, on=.(hit_time_gmt >= Start_Time, hit_time_gmt < End_Time), sum(visitor_id), by=.EACHI]$V1]

I got a warning here, maybe you will too, which is nothing to worry about and is explained here.

phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thanks! The only thing that I am facing, is that it is very slow. Do you know why? (is it because of the " fuzzy_left_join" function? – R overflow Jul 12 '18 at 13:36
  • Looks good! The only thing that I am facing now, is that the No_Visitors_in_range is not always correct. I have often, that the No_Visitors_in_range = 1. But if I check the data, then it doesn't make sense. Could it be the case that it is because my "Number_Visitors" column contains only 1's (so basically every row is a one, I thought to sum all the "1's" on the same way). – R overflow Jul 12 '18 at 14:46
  • 1
    @Roverflow, without seeing more data where the issue is, I have no idea what might cause this. Guesswork: It might be that your time interval of 10 mins has an overlap somewhere, so you that number_visitor is counted twice. – phiver Jul 12 '18 at 16:13
  • you are totally right. Have updated the data. Hope you can check it one more time, anyway... really appreciated your help! – R overflow Jul 12 '18 at 16:37
  • @Roverflow, your example data doesn't overlap. I only get NA's out of the sum because df1 has dates from 2017-11-20 and df2 has dates from 2017-10-01. But you do indeed have overlapping timeframes in df1. Check for example row 174 and 175. There lies your problem. visitors between 19:33 and 19.39 will be counted in both groups and this leads to double counting. The solution is to base the end_time on the lead of start_time, but then the groups are not evenly distributed (aka not every 10 mins). If that is not an issue I posted a solution in the answer to deal with this. – phiver Jul 12 '18 at 17:09