counting events by event history with R

Question

I have a data table that is structured like this, where I have kept track of processes. If an event occured then I marked a 1 next to it in that day, otherwise 0. I have shown the first few events here, but the real dataset has many many rows (over 500,000), with many unique process id's.

process_id    date         event
00001       01/01/12     0
00002       01/01/12     1
00003       01/01/12     0
...         ...          ...
00001       01/01/19     1
00002       01/01/19     0
00003       01/01/19     1

What I would like to know now is for each observation (row) if an event had occurred in the last year(not including the current date) for that process_id and add a column denoting the flag. Suppose that the row

00002       10/01/18     1

occured in the table, then an output table might look like

process_id     date         event    previousEvent     
00001          01/01/12     0        NA
00002          01/01/12     1        NA
00003          01/01/12     0        NA
...            ...          ...      ...
00001          01/01/19     1        0
00002          01/01/19     0        1
00003          01/01/19     1        0

My current way of doing this is by filtering using the dplyr toolkit, however I assume that since it is not a vectorized approach, that it may not be the most efficient way of doing things. Using the doSNOW package for a parallelized approach, the main loop of the program looks like the following. It simply counts how many times the event occured to determine if the event happened in the last year or not. However, even this approach takes a very long time (about an hour for this many rows on my machine)

result <- foreach(i = 1:nrow(data),
              .options.snow=opts, .combine='rbind', .packages = 'dplyr') 
 %dopar%
{
  d <- nrow(data%>%
      filter(process_id %in% data[i,]$process_id ) %>%
      filter(date>= data[i,]$LastYearDate) %>%
      filter(date< data[i,]$date) %>%
      filter(event > 0))
  return(ifelse(d,1,0))
}

Could there be a better approach? I am pretty new with R and the many techniques to filter tables.

Alexis · Accepted Answer · 2019-07-20T23:12:08.443

You could combine this idiom with a non-equi join:

library(data.table)
library(lubridate)

df <- read.table(header=T, text="
process_id    date         event
00001       00/01/20     1
00002       00/01/20     1
00003       00/01/20     0
00001       01/01/19     1
00002       01/01/19     0
00003       01/01/19     1")

dt <- as.data.table(df)

dt[, date := as.POSIXct(date, format = "%y/%m/%d")]
dt[, prev_year := date - lubridate::dyears(1L)]

positives <- dt[.(1), .(process_id, date, event), on = "event"]

dt[, prev_event := positives[.SD,
                             .(x.event),
                             on = .(process_id, date < date, date >= prev_year),
                             mult = "last"]]

print(dt)
   process_id       date event  prev_year prev_event
1:          1 2000-01-20     1 1999-01-20         NA
2:          2 2000-01-20     1 1999-01-20         NA
3:          3 2000-01-20     0 1999-01-20         NA
4:          1 2001-01-19     1 2000-01-20          1
5:          2 2001-01-19     0 2000-01-20          1
6:          3 2001-01-19     1 2000-01-20         NA

Adjust the date format if necessary, and remove prev_year afterwards if you don't need it.

And if you'd like to add also the date at which the previous event occurred, change the line before print to:

dt[, `:=`(
  c("prev_event", "prev_date"),
  positives[.SD, .(x.event, x.date), on = .(process_id, date < date, date >= prev_year), mult = "last"]
)]

A bit of a shameless plug: with the new version of table.express, you could also write the above as:

library(table.express)
library(data.table)
library(lubridate)

dt <- as.data.table(df) %>%
  start_expr %>%
  mutate(date = as.POSIXct(date, format = "%y/%m/%d")) %>%
  mutate(prev_year = date - lubridate::dyears(1L)) %>%
  end_expr

positives <- dt %>%
  start_expr %>%
  filter_on(event = 1) %>%
  select(process_id, date, event) %>%
  end_expr

dt %>%
  start_expr %>%
  mutate_join(positives,
              process_id, date > date, prev_year <= date,
              mult = "last",
              .SDcols = c(prev_event = "event", prev_date = "date")) %>%
  end_expr

print(dt)
   process_id       date event  prev_year prev_event  prev_date
1:          1 2000-01-20     1 1999-01-20         NA       <NA>
2:          2 2000-01-20     1 1999-01-20         NA       <NA>
3:          3 2000-01-20     0 1999-01-20         NA       <NA>
4:          1 2001-01-19     1 2000-01-20          1 2000-01-20
5:          2 2001-01-19     0 2000-01-20          1 2000-01-20
6:          3 2001-01-19     1 2000-01-20         NA       <NA>

thanks! It's super small, and i should edit my post, but my data follows a %m/%d/%Y format. — DaveNine, Jun 25 '19 at 21:50
gonna give you the check because I need to learn more data.table techniques and this is pretty great :) — DaveNine, Jun 26 '19 at 17:09

James B · Answer 2 · 2019-06-25T18:20:51.843

0

I'm not exactly sure if this is intrinsically better, but here's roughly another way to do something similar.

library(data.table)

dt <- data.table(id = rep(1:10, each = 5), time = rep(1:5, 10), event = 0)
dt[id == 2 & time == 2 | id == 4 & time == 3, event := 1]

go <- function(x, n) {
  z <- rep(0, length(x))
  y <- unique(unlist(lapply(which(x == 1) + 1, seq, len = n)))
  y <- y[y <= length(x)]
  z[y] <- 1
  z
}

dt[, year_event := go(event, 2), id]
dt

edited Jun 25 '19 at 18:20

answered Jun 25 '19 at 18:14

James B

474
2
10

thanks for contributing an alternative approach! I'll check this out :) – DaveNine Jun 25 '19 at 19:40

counting events by event history with R

2 Answers2