3

I am currently using R to process a data set that looks like the following:

age  ep  
 1    0   
 2    0   
 3    1   
 4    1   
 5    1   
 6    1   
 7    0   
 8    0   
 9    1   
10    1   
11    0   

I want to create a variable that will keep track of the first occurrence of ep=1 per series of ep=1. These series will have ep=0 prior to the first ep=1 and ep=0 following the last ep=1 of each series.

I would like the data set to look like this after processing:

 age  ep  first
 1    0   NA
 2    0   NA
 3    1   1
 4    1   NA
 5    1   NA
 6    1   NA
 7    0   NA
 8    0   NA
 9    1   1
10    1   NA
11    0   NA

I am working in data table as this data set is rather large, so I'd prefer to process the data using code for data tables, however if this isn't possible I can convert to a data frame and use other code. Any assistance would be greatly appreciated.

markus
  • 25,843
  • 5
  • 39
  • 58
bziggy
  • 463
  • 5
  • 19

4 Answers4

8

A fast data.table method ...

library(data.table)

dt <- fread("age  ep  
 1    0   
 2    0   
 3    1   
 4    1   
 5    1   
 6    1   
 7    0   
 8    0   
 9    1   
10    1   
11    0")

dt[!shift(ep) & ep, first := 1]

# or more explicit:
dt[shift(ep) != 1 & ep == 1, first := 1]

dt
#     age ep first
#  1:   1  0    NA
#  2:   2  0    NA
#  3:   3  1     1
#  4:   4  1    NA
#  5:   5  1    NA
#  6:   6  1    NA
#  7:   7  0    NA
#  8:   8  0    NA
#  9:   9  1     1
# 10:  10  1    NA
# 11:  11  0    NA

Note: just for clarity, if your object is not already a data.table. You can coerce it to a data.table:

setDT(dt)
Henrik
  • 65,555
  • 14
  • 143
  • 159
Khaynes
  • 1,976
  • 2
  • 15
  • 27
3

Another option using an update join

dt[, first := dt[dt[, .I[1], by=rleid(ep)]$V1][ep == 1][dt, on=.(age), ep]]
dt
#    age ep first
# 1:   1  0    NA
# 2:   2  0    NA
# 3:   3  1     1
# 4:   4  1    NA
# 5:   5  1    NA
# 6:   6  1    NA
# 7:   7  0    NA
# 8:   8  0    NA
# 9:   9  1     1
#10:  10  1    NA
#11:  11  0    NA

Using data provided by @Khaynes

markus
  • 25,843
  • 5
  • 39
  • 58
3

An approach using fifelse

dt[, first := fifelse( ep == 1 & shift( ep , type = "lag" ) == 0L, 1L, NA_integer_) ]
dt
#    age ep first
# 1:   1  0    NA
# 2:   2  0    NA
# 3:   3  1     1
# 4:   4  1    NA
# 5:   5  1    NA
# 6:   6  1    NA
# 7:   7  0    NA
# 8:   8  0    NA
# 9:   9  1     1
# 10:  10  1    NA
# 11:  11  0    NA
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
3

Another update join version, using mult="first" to only overwrite the first matching row in the group:

dt[, rid := rleid(ep)][dt[ep==1], on=.(rid), mult="first", first := 1]
dt
#    age ep rid first
# 1:   1  0   1    NA
# 2:   2  0   1    NA
# 3:   3  1   2     1
# 4:   4  1   2    NA
# 5:   5  1   2    NA
# 6:   6  1   2    NA
# 7:   7  0   3    NA
# 8:   8  0   3    NA
# 9:   9  1   4     1
#10:  10  1   4    NA
#11:  11  0   5    NA
thelatemail
  • 91,185
  • 12
  • 128
  • 188