4

I have two data frames, df.events and df.activ.

df.activ has very granular minute level data and an order of magnitude more records (1,000,000+) than df.events which has ~100,000 records, also at minute level granularity. The two dataframes have two common fields, DateTime and Geo. Both DateTime columns are in as.POSIXlt, %Y-%m-%d %H:%M:%S format.

df.activ <- read.table(text=
                          '"DateTime","Geo","Bin1","Bin2"
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,510,0,1
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:13:00,618,1,1
                        2014-07-01 00:13:00,510,0,1
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")

df.events <- read.table(text=
                          '"Units","Geo","DateTime"
                        225,999,2014-07-01 00:09:00
                        40,510,2014-07-01 00:12:00
                        5,999,2014-07-01 00:28:00
                        115,999,2014-07-01 00:44:00
                        0,999,2014-07-01 00:47:00',header=TRUE,sep=",")

My goal is to merge df.activ to the nearest DateTime in df.events if the Geo field value in same row (in df.events) is 999.

If df.event's Geo is not 999 then I only want to merge on df.event if the Geo fields match (e.g. the cases where Geo = 510 in the provided data frames).

I know for-loops aren't the right way to solve things in R, but conceptually I'm looking to do a nested for-loop, by looping down df.activ's DateTime field and bringing on the record with the closest DateTime from df.events if the Geo field is 999 or matches the Geo field in df.activ.

The data frame below is what I'm after:

df.idealresults <- read.table(text=
                              'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
                              7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')

Thus far, I've been able to merge df.activ to the nearest DateTime in df.events. I accomplished this using a na.locf based approach inspired by the latter part of the answer to this SO post. I've struggled trying to incorporate the Geo matching logic into this approach; the nature of na.locf makes it difficult to make this work right since it relies on vector to vector NAs which are bound on before the merge step.

Community
  • 1
  • 1
leaRningR909
  • 235
  • 3
  • 10

2 Answers2

3

It is sometimes hard to avoid loops, especially when you have conditions like you do. Sometimes we end up spending much efforts avoiding them while they are probably either the best we can do, or are not too far behind in terms of performance and/or readability. Having said that, this would do the trick:

df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • Thank you Dominic. This works great. Just a quick follow-up question: Is it R best practice to avoid loops in this situation if I'm going to scale this code to a dataframes with millions+ records? I'm going to try and modify the function + sapply approach from [this post] (http://stackoverflow.com/questions/17321886/matching-data-from-one-dataframe-to-a-time-block-that-it-fits-within-in-a-second?lq=1) and do a head to head with system.time with the loop approach. – leaRningR909 Mar 16 '15 at 05:51
  • You're very welcome! For your question, it's hard to find a definitive answer to that. There's lots of things that have been said and are still being said about this topic. I'm not used to work with huge databases so generally I'm fine with loops. I don't think a million rows would make it a problem unless you have a low memory machine (and even then, the issue is more about speed than memory). But I'd be curious to know how things work out for you; feel free to update us with your results if you do come up with alternate solutions. – Dominic Comtois Mar 16 '15 at 05:56
0

I am at work and this seems relatively solved so I will be brief. You can also do a full outer merge, then simply take the differences in the dates. Use distinct sorted by the absolute value of the difference in the dates.

This is probably the algorithmically fastest way to do your merge but will require more RAM than for looping (your full merge will have n1*n2 observations).

Mysterio
  • 11
  • 2