2

I have two datasets that I would like to match based on datetime information. Datetime values in dataset 1 represent 10-minute time blocks that end at the given timestamp. I would like to match each row in dataset 2 with the 10-minute time block (and associated data) that it falls within from dataset 1. An example of the data is below:

Dataset 1
datetime            count
10/11/2012 16:25    231
10/11/2012 16:35    55
10/11/2012 16:45    0
10/11/2012 16:55    30
10/11/2012 17:05    22
10/11/2012 17:15    431
10/11/2012 17:25    455
10/11/2012 17:35    560
10/11/2012 17:45    7
10/11/2012 17:55    36
10/11/2012 18:05    12
10/11/2012 18:15    144
10/11/2012 18:25    583

Dataset 2
datetime            a       n   dpm
10/11/2012 16:26    2.03    27  3473
10/11/2012 17:24    1.35    28  3636
10/11/2012 18:21    7.63    29  3516

I would like to end up with something that looks like this:

datetime2           a       n   dpm     datetime1           count
10/11/2012 16:26    2.03    27  3473    10/11/2012 16:35    55
10/11/2012 17:24    1.35    28  3636    10/11/2012 17:25    455
10/11/2012 18:21    7.63    29  3516    10/11/2012 18:25    583

I am aware that there are functions such as cbind(), rbind() and merge() that can be used to merge datasets (based on a related question that I asked in the past: How to merge two data frames in r by a common column with mismatched date/time values), but I have been unable to write code that will help in this case. I have tried various combinations of cbind(round(data, “mins”)) and cbind(trun(data, “hours”)), but these functions seem to match multiple time blocks from dataset 1 to each row of dataset 2, rather than just the block that each row falls within. I have tried for hours to find some conditional rule that will solve this problem, but am having no luck. Any help would be greatly appreciated.

Community
  • 1
  • 1
Emily
  • 859
  • 5
  • 14
  • 31

2 Answers2

2

One option is to use prevailing join of data.table Also known as and last observation carried forward (locf).

library(data.table)
DT  <- data.table(dataset1,key='datetime')
DT1 <- data.table(dataset2,key='datetime')
DT[DT1,roll='nearest']
             datetime count    a  n  dpm
1: 2012-11-10 16:26:00   231 2.03 27 3473
2: 2012-11-10 17:24:00   455 1.35 28 3636
3: 2012-11-10 18:21:00   583 7.63 29 3516

EDIT

Here how you can read the data. generally I use zoo package with read.zoo or one trick is to add a new column name as above:

dat1 <- read.table(text='date time            a       n   dpm
10/11/2012 16:26    2.03    27  3473
10/11/2012 17:24    1.35    28  3636
10/11/2012 18:21    7.63    29  3516',header=TRUE)

dat1$datetime <- as.POSIXct(paste0(dat1$date,dat1$time),
                          format='%d/%m/%Y %H:%M')
     a  n  dpm            datetime
1 2.03 27 3473 2012-11-10 16:26:00
2 1.35 28 3636 2012-11-10 17:24:00
3 7.63 29 3516 2012-11-10 18:21:00
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • thanks for the comment. I've tried using the above code but I just get the following error message when using roll="nearest": `Error in `[.data.frame`(DT, DT1, roll = "nearest") : unused argument(s) (roll = "nearest")`. Any thoughts? – Emily Jun 26 '13 at 15:40
  • @Emily you should use `data.table` and not `data.frame`. The error mesaage you show say [.data.frame`(DT, DT1, roll = "nearest"). – agstudy Jun 26 '13 at 15:48
  • apologies but I am relatively new to R! I thought that the first two lines of your code converted my data.frames (with POSIXt class datatime columns) to data.tables. I have now done this manually using dat1<- data.table(dat.1), and the data.table function will not accept the key='datetime' because a type 'list' is not allowed as a key (i assume it does not like POSIXct or POSIXlt objects). When datetime is left as a factor I get a similar error message saying that it does not like factors! – Emily Jun 26 '13 at 16:41
  • @Emily just convert your `POSIXlt` columns to `POSIXct` with `as.POSIXct`. Do this before turning your data frames into data tables. – Hong Ooi Jun 26 '13 at 16:44
  • POSIXlt is a list. You should Use POSIXct which is numeric (Use `mode` to ckeck this). Besides @HongOoi comment, I show in my edit how you can coerce the datetime as POSIXct. – agstudy Jun 26 '13 at 16:45
  • @agstudy: thanks for clarifying the above, and for bearing with me. The code now works! However, the output data.table no longer includes the datatime column from dataset 1, which I would like. Is there a way to do this? Many thanks. – Emily Jun 27 '13 at 10:03
2

Something like this ?

findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
res <- cbind(df2, df1[rows,])

           datetime    a  n  dpm         datetime count
2  10/11/2012 16:26 2.03 27 3473 10/11/2012 16:35    55
7  10/11/2012 17:24 1.35 28 3636 10/11/2012 17:25   455
13 10/11/2012 18:21 7.63 29 3516 10/11/2012 18:25   583

PS1: I think the count of your expected result is wrong on row #1

PS2: It would have been easier if you had provided the datasets in a directly usable form. I had to do:

d1 <- 
'datetime            count
10/11/2012 16:25    231
...
'
d2 <- 
'datetime            a       n   dpm
10/11/2012 16:26    2.03    27  3473
10/11/2012 17:24    1.35    28  3636
10/11/2012 18:21    7.63    29  3516
'

.parse <- function(s) {
    cs <- gsub('\\s\\s+', '\t', s)
    read.table(text=cs, sep="\t", header=TRUE, stringsAsFactors=FALSE)
}

df1 <- .parse(d1)
df2 <- .parse(d2)
Karl Forner
  • 4,175
  • 25
  • 32
  • I edit my answer to show one way how you can read the data without regular expressions. +1. – agstudy Jun 26 '13 at 15:56
  • @Karl: Thanks for solution! It works well and provides me with an output dataframe that includes both sets of datetimes. I will attempt to provide data in a more usable form in the future. – Emily Jun 27 '13 at 10:45