5

The problem: I have two dataframes that I would like to merge depending on the date/time of one dataframe being in the interval of the other dataframe.

traffic: Date and Time (Posixct), Frequency

mydata: Interval, Sum of Frequency

I would now like to calculate if the Posixct time from traffic is within the interval of mydata and if this is TRUE I would like to count the frequency in the column "Sum of Frequencies" in mydata.

The two problems, that I encountered: 1. traffic data frame has significantly more rows than mydata. I dont know how to tell R to loop through every observation in traffic to check for one row in mydata.

  1. There can be more than one observation fitting in the frequency interval of mydata. I want R to add up all frequencies of the different traffic observations to get a total score of frequencies. Also the intervals are overlapping.

Here is the data:

DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00",  "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)                 
traffic <- data.frame(DateTime, Frequency)

library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00",  "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00",  "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
mydata$SumFrequency <- NA

The expected outcome should be something like this:

mydata$SumFrequency <- c(24, 2, 2)
head(mydata)

I tried int_overlaps from package lubridate. Any tips on how to solve this are higly appreciated!

Jaap
  • 81,064
  • 34
  • 182
  • 193
Anna2803
  • 89
  • 1
  • 8

5 Answers5

4

On a data.table approach with between to filter traffic dataset on time:

setDT(traffic)
setDT(mydata)

mydata[,SumFrequency := as.numeric(SumFrequency)] # coerce logical to numeric for next step.
mydata[,SumFrequency := sum( traffic[ DateTime %between% c(DateTime1, DateTime2), Frequency] ), by=1:nrow(mydata)] 

which give:

             DateTime1           DateTime2                                           Interval SumFrequency
1: 2014-11-01 04:00:00 2014-11-01 04:15:00   2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET           24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST            2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST            2

If there's a lot of row in mydata, it could be better to create an index column and use it in by clause:

mydata[, idx := .I]
mydata[, SumFrequency := sum( traffic[DateTime %between% c(DateTime1, DateTime2),Frequency] ),by=idx]

And this gives:

             DateTime1           DateTime2                                           Interval SumFrequency idx
1: 2014-11-01 04:00:00 2014-11-01 04:15:00   2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET           24   1
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST            2   2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST            2   3
Tensibai
  • 15,557
  • 1
  • 37
  • 57
4

A short solution with foverlaps from the data.table package:

mydata <- data.table(DateTime1, DateTime2, key = c("DateTime1", "DateTime2"))
traffic <- data.table(start = DateTime, end = DateTime, Frequency, key = c("start","end"))

foverlaps(traffic, mydata, type="within", nomatch=0L)[, .(sumFreq = sum(Frequency)), 
                                                      by = .(DateTime1, DateTime2)]

which gives:

             DateTime1           DateTime2 sumFreq
1: 2014-11-01 04:00:00 2014-11-01 04:15:00      24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00       2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00       2
Jaap
  • 81,064
  • 34
  • 182
  • 193
3

I see two solutions :

With data.frame and plyr

You could do it using %within% function in lubridate and with a for-loop or using plyr loop functions like dlply

DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00",  "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)                 
traffic <- data.frame(DateTime, Frequency)

library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00",  "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00",  "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)

library(plyr)
# Create a group-by variable
mydata$NumInt <- 1:nrow(mydata)
mydata$SumFrequency <- dlply(mydata, .(NumInt),
                             function(row){
                               sum(
                                 traffic[traffic$DateTime %within% row$Interval, "Frequency"]
                               )
                             })

mydata
#>             DateTime1           DateTime2
#> 1 2014-11-01 04:00:00 2014-11-01 04:15:00
#> 2 2015-08-01 04:03:00 2015-08-01 04:13:00
#> 3 2015-08-01 14:00:00 2015-08-01 14:15:00
#>                                             Interval NumInt SumFrequency
#> 1   2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET      1           24
#> 2 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST      2            2
#> 3 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST      3            2

With data.table and functions foverlaps

data.table has implemented a function for overlapping joins that you could use in your case with a little trick.
This functions is foverlaps (I uses below data.table 1.9.6)
(see How to perform join over date ranges using data.table? and this presentation)

Notice that you do not need to create interval with lubridate

DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00",  "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)                 
traffic <- data.table(DateTime, Frequency)

library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00",  "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00",  "2015-08-01 14:15:00")
mydata <- data.table(DateTime1 = as.POSIXct(DateTime1), DateTime2 = as.POSIXct(DateTime2))

# Use function `foverlaps`  for overlapping joins

# Here's the trick : create a dummy variable to artificially have an interval
traffic[, dummy:=DateTime]
setkey(mydata, DateTime1, DateTime2)

# do the join
mydata2 <- foverlaps(traffic, mydata, by.x=c("DateTime", "dummy"), type ="within", nomatch=0L)[, dummy := NULL][]
mydata2
#>              DateTime1           DateTime2            DateTime Frequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00         1
#> 2: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:03:00         2
#> 3: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:06:00         3
#> 4: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:08:00         5
#> 5: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:10:00        12
#> 6: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:12:00         1
#> 7: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:13:00         2
#> 8: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:15:00         1
#> 9: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:13:00         1

# summarise with a sum by grouping by each line of mydata
setkeyv(mydata2, key(mydata))
mydata2[mydata, .(SumFrequency = sum(Frequency)), by = .EACHI]
#>              DateTime1           DateTime2 SumFrequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00           24
#> 2: 2015-08-01 04:03:00 2015-08-01 04:13:00            2
#> 3: 2015-08-01 14:00:00 2015-08-01 14:15:00            2
Community
  • 1
  • 1
cderv
  • 6,272
  • 1
  • 21
  • 31
  • your `foverlaps` solution is pretty long-winded, see my answer on how to make it a lot shorter – Jaap Mar 04 '16 at 11:05
  • no need to update your answer, that would make them identical which isn't the purpose imo – Jaap Mar 04 '16 at 14:01
1

As far as point 2 is concerned you can use aggregate for instance

aggData <- aggregate(traffic$Frequency~format(traffic$DateTime, "%Y%m%d h:m"), data=traffic, sum)

This sums all frequencies in minute intervals.

And for point 1. Wouldn't a merge work?

merge(x = myData, y = aggData, by = "DateTime", all.x = TRUE)  

The outer merge is explained here

Community
  • 1
  • 1
Etienne Moerman
  • 331
  • 1
  • 9
  • There is no need for explicit calls to variables if you are using the `data` argument in `aggregate`. – Roman Luštrik Mar 04 '16 at 09:01
  • Unfortunately it doesnt work. The traffic data is already aggregated to the point of aggreagtion that I need. I want to aggregate the frequencies of different rows once the calculation determines that several of the data points are within an interval. When using merge it adds the Frequency column, but only returns NAs. – Anna2803 Mar 04 '16 at 09:50
1

Using a for.loop we could do something like this:

for(i in 1:nrow(mydata)) {
  mydata$SumFrequency[i] <- sum(traffic$Frequency[traffic$DateTime %within% mydata$Interval[i]])
}

> mydata
#            DateTime1           DateTime2                                           Interval SumFrequency
#1 2014-11-01 04:00:00 2014-11-01 04:15:00   2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET           24
#2 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST            2
#3 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST            2
mtoto
  • 23,919
  • 4
  • 58
  • 71