7

I have a panel data set for which I would like to create a counter that increases with each step in the panel but restarts whenever some condition occurs. In my case, I'm using country-year data and want to count the passage of years between an event. Here's a toy data set with the key features of my real one:

df <- data.frame(country = rep(c("A","B"), each=5), year=rep(2000:2004, times=2), event=c(0,0,1,0,0,1,0,0,1,0), stringsAsFactors=FALSE)

What I'm looking to do is to create a counter that is keyed to df$event within each country's series of observations. The clock starts at 1 when we start observing each country; it increases by 1 with the passage of each year; and it restarts at 1 whenever df$event==1. The desired output is this:

   country year event clock
1        A 2000     0     1
2        A 2001     0     2
3        A 2002     1     1
4        A 2003     0     2
5        A 2004     0     3
6        B 2000     1     1
7        B 2001     0     2
8        B 2002     0     3
9        B 2003     1     1
10       B 2004     0     2

I have tried using getanID from splitstackshape and a few variations of if and ifelse but have failed so far to get the desired result.

I'm already using dplyr in the scripts where I need to do this, so I would prefer a solution that uses it or base R, but I would be grateful for anything that works. My data sets are not massive, so speed is not critical, but efficiency is always a plus.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • This question isn't quite a duplicate of the other. The other asks about making a simple counter for all rows within each group. This question asks about counting steps within groups until some condition occurs, then restarting the counter if and when the condition does occur, and doing that repeatedly. This is particularly useful for event history data but could be useful in other contexts, too, and the simpler prior version won't do it. – ulfelder Apr 30 '16 at 09:24

2 Answers2

21

With dplyr that would be:

df %>% 
  group_by(country, idx = cumsum(event == 1L)) %>% 
  mutate(counter = row_number()) %>% 
  ungroup %>% 
  select(-idx)

#Source: local data frame [10 x 4]
#
#   country year event counter
#1        A 2000     0       1
#2        A 2001     0       2
#3        A 2002     1       1
#4        A 2003     0       2
#5        A 2004     0       3
#6        B 2000     1       1
#7        B 2001     0       2
#8        B 2002     0       3
#9        B 2003     1       1
#10       B 2004     0       2

Or using data.table:

library(data.table)
setDT(df)[, counter := seq_len(.N), by = list(country, cumsum(event == 1L))]

Edit: group_by(country, idx = cumsum(event == 1L)) is used to group by country and a new grouping index "idx". The event == 1L part creates a logical index telling us whether the column "event" is an integer 1 or not (TRUE/FALSE). Then, cumsum(...) sums up starting from 0 for the first 2 rows, 1 for the next 3, 2 for the next 3 and so on. We use this new column (+ country) to group the data as needed. You can check it out if you remove the last to pipe-parts in the dplyr code.

talat
  • 68,970
  • 21
  • 126
  • 157
4
library(splitstackshape)
df$counter <- getanID(cbind(df$country, cumsum(df$event)))[,.id]

We take advantage of the fact that you already have zeroes and ones in your event column. That makes indexing much easier. I combine the country column with cumsum(df$event). When that command is run by itself you can see its effect:

cumsum(df$event)
 [1] 0 0 1 1 1 2 2 2 3 3

It will only increase with each 1 value. When combined with the country, we are able to see the increase grouped by country.

From there, we can create an id column. @AnandaMahto's splitstackshape package has the function getanID for that.

 df
   country year event counter
1        A 2000     0       1
2        A 2001     0       2
3        A 2002     1       1
4        A 2003     0       2
5        A 2004     0       3
6        B 2000     1       1
7        B 2001     0       2
8        B 2002     0       3
9        B 2003     1       1
10       B 2004     0       2
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • Ah, got it, thank you. I had tried using `getanID` with `country` and `event` as the grouping variables but hadn't thought of using `cumsum()` to define the tiers within countries instead. Cool. – ulfelder Aug 27 '15 at 13:39