3

Suppose I have a data frame that looks like this:

ID   T  X  Y  Z
1    1  A  A  NA
1    2  B  A  NA
1    3  B  B  NA
1    4  B  A  NA
2    1  A  B  NA
2    2  A  A  NA
2    3  B  A  NA
2    4  A  B  NA
3    1  B  B  NA
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

And I would like to replace the value of Z based on some conditionals that depend on both row and (previous) column values so that the above ends up looking like this:

ID   T  X  Y  Z
1    1  A  A  0
1    2  B  A  0
1    3  B  B  1
1    4  B  A  NA
2    1  A  B  0
2    2  A  A  0
2    3  B  A  0
2    4  A  B  0
3    1  B  B  1
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

The rules:

  1. Z takes the value of 1 the first time (in order by T, and within an ID) that both X and Y one that row have the value B.
  2. Z takes (or retains) the value NA if and only if for any smaller value of T, it has taken the value of 1 already.
  3. When T = 1, Z takes the value of 0 if X and Y on that row do not both equal B.
  4. When T > 1, Z takes the value of 0 if X and Y on that row do not both equal B, AND the value of Z on the previous row = zero.

I want the following to work, and it gets me kinda close but no dice:

df$Z <- NA
for (t in 1:4) {
  df$Z[ (df$X=="B" & df$Y=="B") & df$T==1] <- 1
  df$Z[!(df$X=="B" & df$Y=="B") & df$T==1] <- 0
  if (t>1) {
    df$Z[ (df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 0
    df$Z[!(df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 1
  }
}

On the other hand, I can write series of nested if... then statements looping across all observations, but that is excruciatingly slow (at least, compared to the program I am translating from on Stata).

I am sure I am committing twelve kinds of gaffes in my attempt above, but a few hours of banging my head on this has not resolved it.

So I come to you begging, hat in hand. :)

Edit: it occurs to me that sharing the Stata code (which resolves this so much faster than what I have come up with in R, which is ironic, given my preference for R over Stata's language :) might help with suggestions. This does what I want, and does it fast (even with, say, N=1600, T=11):

replace Z = .
forvalues t = 1(1)4 {
  replace Z = 1 if X == "B" & Y == "B" & T == 1
  replace Z = 0 if X == "B" & Y == "B" & T == 1
  replace Z = 1 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  replace Z = 0 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  }
Alexis
  • 784
  • 8
  • 32
  • For grouping by ID have a look at ddply in the plyr package. – DavidC Dec 25 '13 at 05:23
  • Also you'll want to know about the function "ifelse". – DavidC Dec 25 '13 at 05:25
  • Just out of curiosity, what sort of data needs this kind of manipulation? I mean what field are you in.. Very nicely explained post. Have a +1 :). – Arun Dec 25 '13 at 09:16
  • Also it'd be great if you could tell a bit about the real dimensions of your data. Is it just these many rows? Or this is just a sample data? How big a data you deal with usually? – Arun Dec 25 '13 at 09:21
  • Thank you all (@DavidC: I am tempted to give you a friendly tweak, of course I know about ifelse! Of course, I didn't explicity say so, so I will humbly take *your* tweak ;). – Alexis Dec 25 '13 at 17:25
  • @Arun: Thank you very much. What I am providing a cartoon example of are data for *discrete time event history models* (e.g. for a logit hazard model of event occurrence, or some other binomial family hazard). My field is public health (I am an epidemiologist), but these kinds of models are all over the place (sometimes called *time before failure* or *survival* models). The scale is a little bit wider (i.e. the covariates in a multiple regression context), and much longer (observations in the thousands, across far more time periods, and perhaps nested in other hierarchical orders as well). – Alexis Dec 25 '13 at 17:30
  • @Arun: and, of course, I am simplifying my example a bit so as to keep focus on my programming question. Rather than `X=="A" & Y=="B"` the actual logic is an inequality comparison between two variables. – Alexis Dec 25 '13 at 17:40
  • @Henrik: Thank you for both examples! They're fun to work through and tweak to learn about stuff. They also get my job done pretty quickly.... could be faster, but acceptable for my current purposes. – Alexis Dec 25 '13 at 18:00

2 Answers2

2

Here's one approach using ave and transform:

transform(dat[order(dat$ID, dat$T), ],
          Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
            as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) | 
                                    c(0, diff(x)) < 0)) }))

#    ID T X Y  Z
# 1   1 1 A A  0
# 2   1 2 B A  0
# 3   1 3 B B  1
# 4   1 4 B A NA
# 5   2 1 A B  0
# 6   2 2 A A  0
# 7   2 3 B A  0
# 8   2 4 A B  0
# 9   3 1 B B  1
# 10  3 2 B B NA
# 11  3 3 B B NA
# 12  3 4 B A NA

where dat is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]) is not necessary if the rows are already ordered along ID and T.

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • This solution works the fastest for my purposes. Speed is important when doing lots of Monte Carlo simulations as I am with these. – Alexis Dec 26 '13 at 17:33
1

Another possibillity using by

ll <- by(df, df$ID, function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2 <- do.call(rbind, ll)
df2
#      ID T X Y  Z
# 1.1   1 1 A A  0
# 1.2   1 2 B A  0
# 1.3   1 3 B B  1
# 1.4   1 4 B A NA
# 2.5   2 1 A B  0
# 2.6   2 2 A A  0
# 2.7   2 3 B A  0
# 2.8   2 4 A B  0
# 3.9   3 1 B B  1
# 3.10  3 2 B B NA
# 3.11  3 3 B B NA
# 3.12  3 4 B A NA

Same function but using ddply instead:

library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){        
  x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
  x$Z[x$Z > 1] <- NA
  x
})

df2
l0b0
  • 55,365
  • 30
  • 138
  • 223
Henrik
  • 65,555
  • 14
  • 143
  • 159