Being new to R, I am looking for an efficient way to perform a loop with an analogue of VLOOKUP with two conditions. VLOOKUP allows to look up for a specific value throughout a column and apply it to each row of my data frame.
I have a long data.frame DF
of 3 variables:
Car
: identification number of the observed items (cars). Unique for each car, but not for each row.Date
: date of the observation, format="%Y-%m-%d"Area
: logic variable showing if an observation (Car
) on thisDate
was in a certain area (TRUE) or not (FALSE)
I need to create a new binary variable AreaChange
that shows if the Area
changed in the next 10 days for this Car
: if yes then 1, if no change then 0. I am also interested in one direction of change: from FALSE to TRUE.
It is possible that Area
changes several times in the next 10 days, if at least one of the changes is from FALSE to TRUE, the AreaChange
should equal 1.
It is also possible that some Car
s were observed for less than 10 days at certain periods, in these cases the AreaChange
calculation is also needed.
A sample dataset can look like:
set.seed(1)
DF <- data.frame(
Cars=as.integer(sample(127345:127346, 2000, replace=T)), #2 cars sample
Date=as.Date
(seq(from = as.Date("2015-12-21"), to=as.Date("2017-01-30"), length.out = 2000)),
Area=as.logical(sample(x=c(0,1), prob=c(.7, .3), size=2000, replace=T)))
DF <- DF[!duplicated(DF[,c("Cars","Date")]),] #795 observations
For me it looks as:
- Extracting 10
FutureArea
values for each row, matching on two parameters: sameCar
andDate
between (Date
andDate
+10). I suppose that it can be done in a loop format for the 10 days. - Creating the binary new variable
AreaChange
equaling 0 if all availableFutureArea
values are the same, or if the currentArea
for this row is TRUE.
I have found suggestions on cases with merging 2 data frames or for matching on just 1 condition or without extracting the Area
values on future days, but did not manage to combine them for my case.
For now, I have only managed to get the AreaChange, ignoring the need to match Car
and comparing the Area
only with the Area
in 10 days, not for every day in the next 10 days.
DF$Date10 <- DF$Date+10
library(expss)
DF$Area10 <- vlookup(DF$Date10, DF[,1:3], result_column = 3, lookup_column = 2)
DF$AreaChange10 <- ifelse(DF$Area10!=DF$Area & DF$Area==FALSE, 1, 0)
The desired output is the AreaChange
column, for instance as following:
- equals 1 if a switch of
Area
from FALSE to TRUE occurred between currentDate
andDate
+10 for the givenCar
, no matter what is the number of NA values during these days, - equals 0 otherwise.
Cars Date Area AreaDay0 AreaDay+1 AreaDay+2 AreaDay+3 AreaDay+4 AreaDay+5 AreaDay+6 AreaDay+7 AreaDay+8 AreaDay9 AreaDay+10 AreaChange Comment
127345 12/21/15 TRUE 1 0 0 0 1 1 0 0 NA 1 0 1 yes,_as_includes_switch_from_0_to_1
127346 12/21/15 TRUE 1 1 1 0 0 0 0 0 0 0 0 0 no,_as_the_switch_is_from_1_to_0
127347 12/22/15 FALSE 0 0 0 0 0 0 0 0 0 0 0 0 no,_as_no_switch
127348 12/22/15 FALSE 0 0 0 0 0 0 0 NA 1 0 0 1 yes,_as_includes_switch_from_0_to_1
127349 12/23/15 TRUE 1 1 1 1 1 1 NA 1 1 1 1 0 no,_as_no_switch
127350 12/21/15 FALSE 0 NA NA NA NA NA NA NA NA NA 1 1 yes,_as_includes_switch_from_0_to_1
Many thanks for any suggestions on how to optimize and proceed.