13

I am stuck with a project where I need to merge two data frames. They look something like this:

Data1
Traffic Source    Registrations    Hour    Minute
organic           1                6        13
social            1                8        54

Data2
Email                     Hour2   Minute2
test@domain.com           6         13
test2@domain2.com         8         55

I have the following line of code to merge the 2 data frames:

merge.df <- merge(Data1, Data2, by.x = c( "Hour", "Minute"),
           by.y = c( "Hour2", "Minute2"))

It would work great if the variable time (hours & minutes) wasn't slightly off between the two data sets. Is there a way to make the column "Minute" match with "Minute2" if it's + or - one minute off?

I thought I could create 2 new columns for data set one:

Data1
Traffic Source    Registrations   Hour   Minute    Minute_plus1   Minute_minus1
organic           1               6        13      14              12
social            1               8        54      55              53

Is it possible to merge the 2 data frames if "Minute2" matches any variable from either "Minute", "Minute_plus1", or "Minute_minus1"? Or is there a more efficient way to accomplish this merge?

Frank
  • 66,179
  • 8
  • 96
  • 180
heyydrien
  • 971
  • 1
  • 11
  • 28
  • 1
    Create a column `rounded.time = round(hour + minute/60, digits=n)` and try to adapt the rounding `n` as you wish to have +-1 minute precision. Then join by `rounded.time` – faidherbard Apr 28 '15 at 18:39
  • @faidherbard Could get some false positives if the times are close together, but I suppose no more or less than if minutes were accidentally close. – Brandon Bertelsen Apr 28 '15 at 18:42
  • I think the `plyr` and `data.table` packages could accomplish this fairly simply. (I don't know either very well so I'll defer that to someone else.) – Alex A. Apr 28 '15 at 18:42
  • @faidherbard I'm not so sure that will work. However you define the rounding, there will be some cutoff between two groups, and individual measurements in each group could be arbitrarily close. For example, if you round to the nearest 5 minutes, 12 and 13 will be separated even though they're only 1 minute apart. – Gregor Thomas Apr 28 '15 at 18:45
  • True! +1 to @Gregor's `dplyr` solution – faidherbard Apr 28 '15 at 18:52
  • Would it be possible to construct a datetime variable rather than having multiple separate integer time component variables? – Alex A. Apr 28 '15 at 18:59
  • 1
    Check `foverlaps` from `data.table` package. It is designed for performing overlapping interval joins efficiently. Joining the entire data first and then filtering is quite inefficient. [Here's a detailed post on using `foverlaps`](http://stackoverflow.com/a/25655497/559784) – Arun Apr 29 '15 at 15:43

1 Answers1

19

For stuff like this I usually turn to SQL:

library(sqldf)
x = sqldf("
  SELECT *
  FROM Data1 d1 JOIN Data2 d2
  ON d1.Hour = d2.Hour2
  AND ABS(d1.Minute - d2.Minute2) <= 1
")

Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:

library(dplyr)
x = Data1 %>%
  left_join(Data2, by = c("Hour" = "Hour2")) %>%
  filter(abs(Minute - Minute2) <= 1)

though you could do the same thing with base functions.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Nice solution, but what about, for example, if `d1.Hour` and `d1.Minute` are 6 and 59 and `d2.Hour2` and `d2.Minute2` are 7 and 0? As times they're within a minute of each other but `abs(d1.Minute - d2.Minute2) > 1`. – Alex A. Apr 28 '15 at 18:49
  • 1
    @AlexA. I think thats the trade off to a manual merge. Perhaps build in a warning or test to tell the user that this has occured? – Brandon Bertelsen Apr 28 '15 at 18:52
  • @BrandonBertelsen: Not my question, but that's definitely something for the OP to consider. – Alex A. Apr 28 '15 at 18:54
  • The best solution for these in-comments cases would be to have these as some sort of unified datetime object, not as separate integer columns. – Gregor Thomas Apr 28 '15 at 18:57
  • @Gregor: Good point. I asked the OP whether that's possible in their situation. – Alex A. Apr 28 '15 at 19:00
  • But how do you join by datetime object? Tricky – faidherbard Apr 28 '15 at 19:00
  • 4
    Perhaps `... on abs(60 * d1.Hour + d1.Minute - 60 * d2.Hour - d2.Minute) <= 1` – G. Grothendieck Apr 28 '15 at 23:56