2

I wish to merge two datasets using ‘date time’ columns that are present in both (POSIXct format: dd/mm/yyyy hh:mm). Below is example data from the two datasets:

# Dataset 1 (dts1)

           datetime   count   period   
1  30/03/2011 02:32      27      561
2  30/03/2011 02:42       3      600
3  30/03/2011 02:52       0      574
4  30/03/2011 03:02       1      550
5  30/03/2011 03:12      15      600
6  30/03/2011 03:22       0      597

# Dateset 2 (dts2)

   datetime         dist car   satd      alt
1 30/03/2011 01:59  23.9   1      3     1.76       
2 30/03/2011 02:58  14.7   1      7     6.36       
3 30/03/2011 03:55  10.4   2      9    -0.34      
4 30/03/2011 04:53  35.4   1      3     3.55      
5 30/03/2011 05:52  56.1   1      7    -0.91       
6 30/03/2011 06:48  12.3   1      4     6.58      
7 30/03/2011 07:48  10.7   1      5     4.18      

If it was a simple case of merging matching rows from the two frames then basic merge(data1, data2, by="datetime") or rbind() function could be used.

However, my problem is more complicated as the time intervals in the two datasets are not equal. Dataset 1 contains data at precise 10-minute intervals (each row incorporates information on the 10-minute block that ends at the specified date/time), while dataset 2 contains data at approximate 1-hour intervals (each row incorporates information from the 1-hour block that ends at the specified date/time).

To make things further harder, there is a time mismatch between the start times of the rows in the two datasets (i.e. dts1: 01/03/2013 10:00:00, dts2: 01/03/2012 09:58:12). dts2 intervals also vary throughout the dataset (± few minutes either side of 1 hour). I want to link every 10-minute data row in dataset 1 with the hour block that it fits within in dataset 2 (along with all the associated column values from dts2). There will be some rows that could fit into 2 different hour blocks (i.e. 30/03/2011 03:02), but I only need these rows to be linked to one of the hour blocks.

I would like to end up with something like this:

        datetime_dts1 count period     datetime2_dts2  dist  car satd      alt  
1    30/03/2011 02:32    27    561   30/03/2011 02:58  14.7   1     7     6.36      
2    30/03/2011 02:42     3    600   30/03/2011 02:58  14.7   1     7     6.36
3    30/03/2011 02:52     0    574   30/03/2011 02:58  14.7   1     7     6.36
4    30/03/2011 03:02     1    550   30/03/2011 02:58  14.7   1     7     6.36
5    30/03/2011 03:12    15    600   30/03/2011 03:55  10.4   2     9    -0.34
6    30/03/2011 03:22     0    597   30/03/2011 03:55  10.4   2     9    -0.34

I have searched for an answer to this problem but have not been able to solve it and my R abilities are not advanced. If anyone could give me a direction or provide a solution, I would be extremely grateful.

Arun
  • 116,683
  • 26
  • 284
  • 387
Emily
  • 859
  • 5
  • 14
  • 31
  • 1
    could you provide a `dput` output? it is a pain to get the data – Arun Mar 04 '13 at 11:53
  • 1
    I couldn't really follow your long text, but couldn't you just round the times and merge based on rounded times? – Roland Mar 04 '13 at 11:54
  • 5th row, `3:12` is closer to `2:58` than to `3:55`. On what basis do you assign it to `3:55`? If there are no rules, it is difficult to merge. – Arun Mar 04 '13 at 13:19
  • Hi Arun/Roland. Thanks for your comments. Arun- I assign 3:12 to 3:55 as wish for all dts1 values to be matched with the nearest hh:mm row in dts2 greater than itself, unless the value in dts1 is less than 5 minutes more than the hour:min row in dts2 proceeding it (example below). Sorry I am not sure how to provide a dput output. – Emily Mar 10 '13 at 19:02

2 Answers2

6

After first converting your datetime character strings to POSIXt classes, some combination of rounding and truncating those times should get you something you can use as the basis of a merge.

First read in your data, and create corresponding POSIXt datetimes:

dts1 <- structure(list(datetime = structure(1:6,
   .Label = c("30/03/2011 02:32", "30/03/2011 02:42",
   "30/03/2011 02:52", "30/03/2011 03:02", "30/03/2011 03:12",
   "30/03/2011 03:22"), class = "factor"), count = c(27L, 3L,
   0L, 1L, 15L, 0L), period = c(561L, 600L, 574L, 550L, 600L,
   597L)), .Names = c("datetime", "count", "period"),
   class = "data.frame", row.names = c(NA, -6L))
dts2 <- structure(list(datetime = structure(1:7,
    .Label = c("30/03/2011 01:59", "30/03/2011 02:58",
    "30/03/2011 03:55", "30/03/2011 04:53", "30/03/2011 05:52",
    "30/03/2011 06:48", "30/03/2011 07:48"), class = "factor"),
    dist = c(23.9, 14.7, 10.4, 35.4, 56.1, 12.3, 10.7), car =
    c(1L, 1L, 2L, 1L, 1L, 1L, 1L), satd = c(3L, 7L, 9L, 3L, 7L,
    4L, 5L), alt = c(1.76, 6.36, -0.34, 3.55, -0.91, 6.58,
    4.18)), .Names = c("datetime", "dist", "car", "satd",
    "alt"), class = "data.frame", row.names = c(NA, -7L))

# create corresponding POSIXlt vector
# (you could update the 'datetime' columns in-place if you prefer)
datetime1 <- strptime(dts1$datetime, format="%d/%m/%Y %H:%M")
datetime2 <- strptime(dts2$datetime, format="%d/%m/%Y %H:%M")

The following code produces a merged table based on the nearest hour in all cases. Inside the merge it's just prepending a column with the rounded times to each of your data frames, merging based on that (i.e., column number 1), then using the -1 index to remove that column at the end:

# merge based on nearest hour
merge(
    cbind(round(datetime1, "hours"), dts1),
    cbind(round(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd  alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7 6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7 6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7 6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7 6.36
5 30/03/2011 03:12    15    600 30/03/2011 02:58 14.7   1    7 6.36
6 30/03/2011 03:22     0    597 30/03/2011 02:58 14.7   1    7 6.36

As above, but this time just truncating on hour:

merge(
    cbind(trunc(datetime1, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 03:55 10.4   2    9 -0.34
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

As above, but for dts1 treat the record as belonging to previous hour until 10 minutes past the hour, by subtracting 10*60 seconds before truncating. This one produces the same output you specified, though without more information I'm not sure that it's the exact rule you want.

merge(
    cbind(trunc(datetime1 - 10*60, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7  6.36
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

You could tweak the details of which ones you round, which ones you truncate, and whether you first subtract/add some time depending on your specific rule.

Edit:

Not the most elegant, but here is a different approach that accommodates the more complicated conditional rule you described in your comments. This leans heavily on na.locf from the zoo package to first determine which dts2 times come before and after each dts1 record. With those in hand, it's just a matter of applying the rule to select the desired dts2 time, matching back to the original dts1 table, then merging.

library(zoo)

# create ordered list of all datetimes, using names to keep
# track of which ones come from each data frame
alldts <- sort(c(
    setNames(datetime1, rep("dts1", length(datetime1))),
    setNames(datetime2, rep("dts2", length(datetime2)))))
is.dts1 <- names(alldts)=="dts1"

# for each dts1 record, get previous closest dts2 time
dts2.prev <- alldts
dts2.prev[is.dts1] <- NA
dts2.prev <- na.locf(dts2.prev, na.rm=FALSE)[is.dts1]

# for each dts1 record, get next closest dts2 time
dts2.next <- alldts
dts2.next[is.dts1] <- NA
dts2.next <- na.locf(dts2.next, na.rm=FALSE, fromLast=TRUE)[is.dts1]

# for each dts1 record, apply rule to choose dts2 time
use.prev <- !is.na(dts2.prev) & (alldts[is.dts1] - dts2.prev < 5)
dts2.to.use <- ifelse(use.prev, as.character(dts2.prev), 
    as.character(dts2.next))

# merge based on chosen dts2 times, prepended as character vector
# for the purpose of merging
merge(
    cbind(.dt=dts2.to.use[match(datetime1, alldts[is.dts1])], dts1),
    cbind(.dt=as.character(datetime2), dts2),
    by=".dt", all.x=TRUE, suffixes=c("_dts1", "_dts2")
)[-1]
regetz
  • 671
  • 3
  • 7
  • Thank you for your suggestions regetz. Very helpful. The problem with subtracting 10*60 before truncating is that it only considers the hour, not the minutes. My hour values in dts2 are not always 60 minutes long. Instead, I want the script to look at each row in dts1 and match it to the nearest hour:min row in dts2 that is greater than itself, unless the value in dts1 is less than 5 minutes more than the hour:min row in dts2 proceeding it. – Emily Mar 10 '13 at 18:32
  • An Example: `30/03/2011 03:32` matched to `30/03/2011 03:49`, `30/03/2011 03:42` matched to `30/03/2011 03:49`, **`30/03/2011 03:52` matched to `30/03/2011 03:49`**, `30/03/2011 04:02` matched to `30/03/2011 04:42` and so on.... – Emily Mar 10 '13 at 18:52
  • Excellent, this is exactly what I was looking for. Many thanks for your extensive answer. Much appreciated! – Emily Mar 12 '13 at 22:24
0

The selector of data.table seems well suited for that. It is really efficient and allow to merge the nearest value (upper or lower or both).

Find on this website: https://www.r-bloggers.com/understanding-data-table-rolling-joins/ example for left, right joins...etc

website[, join_time:=session_start_time]
paypal[, join_time:=purchase_time]
setkey(website, name, join_time)
setkey(paypal, name, join_time)
website[paypal, roll = T]

About DT: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

pepece
  • 360
  • 5
  • 17