I have two data frames, df.events and df.activ.
df.activ has very granular minute level data and an order of magnitude more records (1,000,000+) than df.events which has ~100,000 records, also at minute level granularity. The two dataframes have two common fields, DateTime and Geo. Both DateTime columns are in as.POSIXlt, %Y-%m-%d %H:%M:%S format.
df.activ <- read.table(text=
'"DateTime","Geo","Bin1","Bin2"
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,510,0,1
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:13:00,618,1,1
2014-07-01 00:13:00,510,0,1
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")
df.events <- read.table(text=
'"Units","Geo","DateTime"
225,999,2014-07-01 00:09:00
40,510,2014-07-01 00:12:00
5,999,2014-07-01 00:28:00
115,999,2014-07-01 00:44:00
0,999,2014-07-01 00:47:00',header=TRUE,sep=",")
My goal is to merge df.activ to the nearest DateTime in df.events if the Geo field value in same row (in df.events) is 999.
If df.event's Geo is not 999 then I only want to merge on df.event if the Geo fields match (e.g. the cases where Geo = 510 in the provided data frames).
I know for-loops aren't the right way to solve things in R, but conceptually I'm looking to do a nested for-loop, by looping down df.activ's DateTime field and bringing on the record with the closest DateTime from df.events if the Geo field is 999 or matches the Geo field in df.activ.
The data frame below is what I'm after:
df.idealresults <- read.table(text=
'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')
Thus far, I've been able to merge df.activ to the nearest DateTime in df.events. I accomplished this using a na.locf based approach inspired by the latter part of the answer to this SO post. I've struggled trying to incorporate the Geo matching logic into this approach; the nature of na.locf makes it difficult to make this work right since it relies on vector to vector NAs which are bound on before the merge step.