I have a data table that is structured like this, where I have kept track of processes. If an event occured then I marked a 1 next to it in that day, otherwise 0. I have shown the first few events here, but the real dataset has many many rows (over 500,000), with many unique process id's.
process_id date event
00001 01/01/12 0
00002 01/01/12 1
00003 01/01/12 0
... ... ...
00001 01/01/19 1
00002 01/01/19 0
00003 01/01/19 1
What I would like to know now is for each observation (row) if an event had occurred in the last year(not including the current date) for that process_id and add a column denoting the flag. Suppose that the row
00002 10/01/18 1
occured in the table, then an output table might look like
process_id date event previousEvent
00001 01/01/12 0 NA
00002 01/01/12 1 NA
00003 01/01/12 0 NA
... ... ... ...
00001 01/01/19 1 0
00002 01/01/19 0 1
00003 01/01/19 1 0
My current way of doing this is by filtering using the dplyr
toolkit, however I assume that since it is not a vectorized approach, that it may not be the most efficient way of doing things. Using the doSNOW
package for a parallelized approach, the main loop of the program looks like the following. It simply counts how many times the event occured to determine if the event happened in the last year or not. However, even this approach takes a very long time (about an hour for this many rows on my machine)
result <- foreach(i = 1:nrow(data),
.options.snow=opts, .combine='rbind', .packages = 'dplyr')
%dopar%
{
d <- nrow(data%>%
filter(process_id %in% data[i,]$process_id ) %>%
filter(date>= data[i,]$LastYearDate) %>%
filter(date< data[i,]$date) %>%
filter(event > 0))
return(ifelse(d,1,0))
}
Could there be a better approach? I am pretty new with R and the many techniques to filter tables.