0

Being new to R, I am looking for an efficient way to perform a loop with an analogue of VLOOKUP with two conditions. VLOOKUP allows to look up for a specific value throughout a column and apply it to each row of my data frame.

I have a long data.frame DF of 3 variables:

  • Car: identification number of the observed items (cars). Unique for each car, but not for each row.
  • Date: date of the observation, format="%Y-%m-%d"
  • Area: logic variable showing if an observation (Car) on this Date was in a certain area (TRUE) or not (FALSE)

I need to create a new binary variable AreaChange that shows if the Area changed in the next 10 days for this Car: if yes then 1, if no change then 0. I am also interested in one direction of change: from FALSE to TRUE.

It is possible that Area changes several times in the next 10 days, if at least one of the changes is from FALSE to TRUE, the AreaChange should equal 1.

It is also possible that some Cars were observed for less than 10 days at certain periods, in these cases the AreaChange calculation is also needed.

A sample dataset can look like:

set.seed(1)
DF <- data.frame(
Cars=as.integer(sample(127345:127346, 2000, replace=T)), #2 cars sample
Date=as.Date
(seq(from = as.Date("2015-12-21"), to=as.Date("2017-01-30"), length.out = 2000)),
Area=as.logical(sample(x=c(0,1), prob=c(.7, .3), size=2000, replace=T)))
DF <- DF[!duplicated(DF[,c("Cars","Date")]),] #795 observations 

For me it looks as:

  1. Extracting 10 FutureArea values for each row, matching on two parameters: same Car and Date between (Date and Date+10). I suppose that it can be done in a loop format for the 10 days.
  2. Creating the binary new variable AreaChange equaling 0 if all available FutureArea values are the same, or if the current Area for this row is TRUE.

I have found suggestions on cases with merging 2 data frames or for matching on just 1 condition or without extracting the Area values on future days, but did not manage to combine them for my case.

For now, I have only managed to get the AreaChange, ignoring the need to match Car and comparing the Area only with the Area in 10 days, not for every day in the next 10 days.

DF$Date10 <- DF$Date+10
library(expss)
DF$Area10 <- vlookup(DF$Date10, DF[,1:3], result_column = 3, lookup_column = 2)
DF$AreaChange10 <- ifelse(DF$Area10!=DF$Area & DF$Area==FALSE, 1, 0)

The desired output is the AreaChange column, for instance as following:

  • equals 1 if a switch of Area from FALSE to TRUE occurred between current Date and Date+10 for the given Car, no matter what is the number of NA values during these days,
  • equals 0 otherwise.
Cars Date Area AreaDay0 AreaDay+1 AreaDay+2 AreaDay+3 AreaDay+4 AreaDay+5 AreaDay+6 AreaDay+7 AreaDay+8 AreaDay9 AreaDay+10 AreaChange Comment 
127345 12/21/15 TRUE 1 0 0 0 1 1 0 0 NA 1 0 1 yes,_as_includes_switch_from_0_to_1
127346 12/21/15 TRUE 1 1 1 0 0 0 0 0 0 0 0 0 no,_as_the_switch_is_from_1_to_0
127347 12/22/15 FALSE 0 0 0 0 0 0 0 0 0 0 0 0 no,_as_no_switch
127348 12/22/15 FALSE 0 0 0 0 0 0 0 NA 1 0 0 1 yes,_as_includes_switch_from_0_to_1
127349 12/23/15 TRUE 1 1 1 1 1 1 NA 1 1 1 1 0 no,_as_no_switch
127350 12/21/15 FALSE 0 NA NA NA NA NA NA NA NA NA 1 1 yes,_as_includes_switch_from_0_to_1

Many thanks for any suggestions on how to optimize and proceed.

NLavins
  • 1
  • 1
  • 1
    Welcome to SO, NLavins! Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Nov 21 '19 at 06:07
  • Thank you for advice with explanation and links. It definitely makes the case more clear. I have added a code part to generate a sample dataset, as well as an example of the logic of the needed output variable. – NLavins Nov 21 '19 at 11:55

1 Answers1

0

This one is hard to answer without sample data, so please add some to your question (see comment from @r2evans )if the code below is not working for you.

The solution below uses the data.table package.

First, since none was provided, I made up some sample data based on the description in your question. It is named dt.

library( data.table )
#build sample data
dt <- data.table( Car = 1, 
                  Date = seq( as.Date( "2019-01-01"), by = "1 days", length.out = 300 ),
                  Area = rep( rep( c(TRUE, FALSE), each = 75 ), 2 ) )

Then, I craeted a separate lookup-table with all CAR + Date values, and I added a +30 days End_Date to each row. The table is namse dt.lookup.

#create lookup-table  based on dt$Date with end of lookup-period
dt.lookup <- copy(dt)[, Area := NULL ]
dt.lookup[, End_Date := Date + 30 ]

Then, I uses a data.table non-equi join to find all observations in dt that fall within the periods defined in dt.lookup. I wrote the to a new data.table, ans. Of course, you'll get much rows than you ararted with, so I set allow.cartesian = TRUE to make sure the join behaves itself.

#perform non-equi join to find all AREA value withing the period
ans <- dt[ dt.lookup, on = .( Car, Date >= Date, Date < End_Date ), allow.cartesian = TRUE ]

After joining, summarise by Car and Date, to find all unique values of Area within that time-periode of 30 days. If this value equals 1, there has bene no change. But if this value ewquals 2, TRUE and FALSE have bean an Area-value in thei periode.
Now it's easy to find the periodes with an Area_Change!

#summarise to all unique AREA-values per 60-day periode
ans2 <- ans[ , .(total = uniqueN(Area)), by = .( Car, Date )]
#fill column Area_Change
ans2[total == 1, Area_Change := 0 ]
ans2[total == 2, Area_Change := 1 ]

All we have to do now, it to add our newly found preiods with Area_Changes back into out original dt.

#update_join the results back to the original dt
dt[ ans2, Area_Change := i.Area_Change, on = .( Car, Date )]

The code above can be shortened (quite)a bit, but since you are new to SO, I assume you are also pretty new to R. In this way, you can easily check and verify all intermediate results.

Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • thank you so much for suggesting an approach, very useful and much appreciated. I have edited the question: (1) suggesting a sample dataset, which is basically as the one you have created, (2) just to simplify reducing the time period from 30 days to 10 days, what doesn't actually matter, (3) giving a (clumsy) sample table of the "logic" of the output. Thank you a lot, the code works on the original dataset. The only thing I am still struggling with is reflecting only the cases with `AreaChange` from FALSE to TRUE. Do you have by chance any suggestion on how to include it? – NLavins Nov 21 '19 at 19:35