0

For a panel dataset (GSOEP), I need to create a time counter that gives me delta t after an event which is dummy coded 1 for that particular year for each individual. E.g. there are observations for an individual for a random range of years such as 1990-2006, with a seperate variable indicating 1 for a certain event in year e.g. 1996. The counter needs to start in the following year, should end with the next individual (id) and needs to be zero before the event for that individual occurs.

Currently the data looks like this:

df <- data.frame(id= rep(c("1","2","3"), each=6), year=rep(1998:2003, times=3), event=c(0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0), stringsAsFactors=FALSE)

   id year event
1   1 1998     0
2   1 1999     0
3   1 2000     1
4   1 2001     0
5   1 2002     0
6   1 2003     0
7   2 1998     0
8   2 1999     0
9   2 2000     0
10  2 2001     0
11  2 2002     1
12  2 2003     0
13  3 1998     0
14  3 1999     1
15  3 2000     0
16  3 2001     0
17  3 2002     0
18  3 2003     0

What is needed is this:

df <- data.frame(id= rep(c("1","2","3"), each=6), year=rep(1998:2003, times=3), event=c(0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0),delta=c(0,0,0,1,2,3,0,0,0,0,0,1,0,0,1,2,3,4), stringsAsFactors=FALSE)

   id year event delta
1   1 1998     0     0
2   1 1999     0     0
3   1 2000     1     0
4   1 2001     0     1
5   1 2002     0     2
6   1 2003     0     3
7   2 1998     0     0
8   2 1999     0     0
9   2 2000     0     0
10  2 2001     0     0
11  2 2002     1     0
12  2 2003     0     1
13  3 1998     0     0
14  3 1999     1     0
15  3 2000     0     1
16  3 2001     0     2
17  3 2002     0     3
18  3 2003     0     4

How can I achieve this? The closest I got was this here: Create sequential counter that restarts on a condition within panel data groups

But I do not know how to modify it in order that it only starts after the event has occured once and puts zeros before the event. Also there are some individuals for which there are no events, where the counter needs to give out zeros. The number of years (observations) for each individual is different, thus some id's range from 1984-1999 while other do so from 1995-2015.

You would help me out enormously and I want to thank you in advance for your time and effort.

Best Regards,

Julius

Julius
  • 15
  • 6

2 Answers2

2

You can use group_by(id) and cumsum(cummax(event)) to get close - produces 1...N starting where event==1. I wrap it in ifelse(...) to subtract 1 from those values that are > 0.

library(tidyverse)
df %>%
  group_by(id) %>%
  mutate(delta = ifelse(cumsum(cummax(event)) > 0, cumsum(cummax(event)) - 1, 0)) %>%
  ungroup()

# A tibble: 18 x 4
   # id     year event delta
   # <chr> <int> <dbl> <dbl>
 # 1 1      1998    0.    0.
 # 2 1      1999    0.    0.
 # 3 1      2000    1.    0.
 # 4 1      2001    0.    1.
 # 5 1      2002    0.    2.
 # 6 1      2003    0.    3.
 # 7 2      1998    0.    0.
 # 8 2      1999    0.    0.
 # 9 2      2000    0.    0.
# 10 2      2001    0.    0.
# 11 2      2002    1.    0.
# 12 2      2003    0.    1.
# 13 3      1998    0.    0.
# 14 3      1999    1.    0.
# 15 3      2000    0.    1.
# 16 3      2001    0.    2.
# 17 3      2002    0.    3.
# 18 3      2003    0.    4.
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thank you for your response, that's the direction I wanted to go. However, when I run it through my data the vector delta is NULL and does not seem to be created, yet there is no error, even though I've elimanted all NA's priorly. Do you have an idea what might be the problem? – Julius Mar 07 '18 at 14:36
  • Didn't assign a data frame, my bad. Thanks again for the helpful answer! – Julius Mar 07 '18 at 18:24
  • I think this answer is very good since it is very concise, so no critique to CPak, nice approach (+1)! Just wanted to note towards @Julius that my answer produces the same results and is 4 times faster, see the benchmarking I added in my answer below. I would kindly ask you to specify your requirements in future questions. You had not specified that you will only accept `dplyr` / `tidyverse` solutions, which, again, is totally fine, just please specify such requirements more clearly to spare the time of others proposing different approaches. Thank you. – Manuel Bickel Mar 07 '18 at 19:23
  • 1
    Dear Manuel, the thing was that I couldn't get your solution working, as it produced a counter not restarting for new ID's and wasn't as easy for me to handle as an obvious R beginner (my fault). Due to that, I completely believe that your answer is at least equally good, yet I took CPak's as it was the most convenient from my point of view, I'm sorry that I cannot choose both answers at the same time. Thanks again! – Julius Mar 08 '18 at 09:12
  • @Julius Alright, thanks for your feedback and sorry for being so picky. I agree that for beginners packages like dplyr often hold very good ready-made, concise and understandable solutions, only for advanced speed and functionality requirements self made solutions like mine may be of interest. i wish you success in learning R! – Manuel Bickel Mar 11 '18 at 18:10
1

Maybe not the most elegant version, but if your data set is not too large, the following lines might be a start.

library(data.table)
df <- data.frame(id= rep(c("1","2","3"), each=6), year=rep(1998:2003, times=3), event=c(0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0), stringsAsFactors=FALSE)
DT <- as.data.table(df)

get_delta <- function(x) {
  if (all(x == 0)) {
    return(x)
  } else {
    event_position <- which(x == 1)
    x[event_position] <- 0
    if (event_position == length(x)) {
     return(x) 
    } else {
     x[(event_position+1):length(x)] <- seq(length(x)-event_position)
     return(x)
    }
  }
}


DT[, delta:= get_delta(event), by = c("id")]
DT
# id year event delta
# 1:  1 1998     0     0
# 2:  1 1999     0     0
# 3:  1 2000     1     0
# 4:  1 2001     0     1
# 5:  1 2002     0     2
# 6:  1 2003     0     3
# 7:  2 1998     0     0
# 8:  2 1999     0     0
# 9:  2 2000     0     0
# 10:  2 2001     0     0
# 11:  2 2002     1     0
# 12:  2 2003     0     1
# 13:  3 1998     0     0
# 14:  3 1999     1     0
# 15:  3 2000     0     1
# 16:  3 2001     0     2
# 17:  3 2002     0     3
# 18:  3 2003     0     4

n_rows <- 1e6
DT_large <- data.table(id= as.character(rep(c(1:n_rows), each=6))
                       ,year=rep(1998:2003, n_rows), 
                       event = as.vector(sapply(1:n_rows, function(x) {
                         x <- rep(0, 6)
                         x[sample(6, 1)] <- 1  
                         x
                       }))
                       ,stringsAsFactors=FALSE)

system.time(DT_large[, delta:= get_delta(event), by = c("id")])
# User      System     elapsed 
# 9.30        0.02        9.35

#some benchmarking...
library(tidyverse)
library(data.table)
library(microbenchmark)

df <- data.frame(id= rep(c("1","2","3"), each=6), year=rep(1998:2003, times=3), event=c(0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0), stringsAsFactors=FALSE)

CPak_approach <- function() {
  df %>%
    group_by(id) %>%
    mutate(delta = ifelse(cumsum(cummax(event)) > 0, cumsum(cummax(event)) - 1, 0)) %>%
    ungroup()  
}

manuelbickel_approach <- function(x) {
  DT <- as.data.table(df)
  get_delta <- function(x) {
    if (all(x == 0)) {
      return(x)
    } else {
      event_position <- which(x == 1)
      x[event_position] <- 0
      if (event_position == length(x)) {
        return(x) 
      } else {
        x[(event_position+1):length(x)] <- seq(length(x)-event_position)
        return(x)
      }
    }
  }
  DT[, delta:= get_delta(event), by = c("id")]
}


microbenchmark(
  (dplyr_approach()),
  (manuelbickel_approach())
)

# Unit: microseconds
#       expr                      min        lq     mean   median       uq       max neval
# (dplyr_approach())         3731.146 3872.6625 4098.923 3985.363 4194.183  6441.475   100
# (manuelbickel_approach())   803.705  829.5605 1148.891 1014.105 1049.829 13993.372   100
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • Thank you for your quick proposal! Sadly it blows up my ram as the vector gets too big, I will further try to find a solution with dplyr. – Julius Mar 07 '18 at 11:25
  • How large is your data actually. I have updated the example and used a DT with 10e6 rows. This works for me. (I have further updated the function slightly). Another thing is that your programming problem is not primarily a concern of competition between `dplyr` and `data.table`. – Manuel Bickel Mar 07 '18 at 12:58