0

I have a dataset that looks something like this

    Subject  Year   X   Y   
        A   1990    1   0   
        A   1991    1   0   
        A   1992    2   0   
        A   1993    3   1   
        A   1994    4   0   
        A   1995    4   0   
        B   1990    0   0   
        B   1991    1   0   
        B   1992    1   0   
        B   1993    2   1   
        C   1991    1   0   
        C   1992    2   0   
        C   1993    3   0   
        C   1994    3   0   
        D   1991    1   0   
        D   1992    2   0   
        D   1993    3   0   
        D   1994    4   0   
        D   1995    5   0   
        D   1996    5   1   
        D   1997    6   0   

How can I create two additional columns where

  • A1 is 1 if X increased and the maximum for the subject is at least 4. Otherwise it is 0. I tried data$A1 <- as.numeric(data$X >4) However, it's not quite what I want.
  • A2 is a bit more complicated to explain and I have no clue how to perform it in R. But it basically has the same idea as A1 meaning that it still should capture all X's that are more than 3. Only, it should be = 1 when Y = 0 for the following 5 years. I give an example what the A2 variable should look like. Is it possible do this in R? Or do I need to do this manually?

Result:

            Subject  Year   X   A1   Y   A2
                A   1990    1    1   0    0
                A   1991    1    0   0    0
                A   1992    2    1   0    0
                A   1993    3    1   1    0
                A   1994    4    1   0    0
                A   1995    4    0   0    0
                B   1990    0    0   0    0
                B   1991    1    0   0    0
                B   1992    1    0   0    0 
                B   1993    2    0   1    0
                C   1991    1    0   0    0
                C   1992    2    0   0    0 
                C   1993    3    0   0    0 
                C   1994    3    0   0    0
                D   1991    1    1   0    1
                D   1992    2    1   0    1
                D   1993    3    1   0    1
                D   1994    4    1   0    1 
                D   1995    5    1   0    1 
                D   1996    5    0   1    0
                D   1997    6    1   0    0

Rawdata without the variables A1 and A2:

> dput(data)
structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor"), Year = c(1990L, 1991L, 1992L, 
1993L, 1994L, 1995L, 1990L, 1991L, 1992L, 1993L, 1991L, 1992L, 
1993L, 1994L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L
), X = c(1L, 1L, 2L, 3L, 4L, 4L, 0L, 1L, 1L, 2L, 1L, 2L, 3L, 
3L, 1L, 2L, 3L, 4L, 5L, 5L, 6L), Y = c(0L, 0L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L)), .Names = c("Subject", 
"Year", "X", "Y"), class = "data.frame", row.names = c(NA, -21L
))
FKG
  • 285
  • 1
  • 4
  • 17
  • You forgot to include the code you wrote that didn't work when you tried to solve these yourself. – Rich Scriven Jun 27 '16 at 20:28
  • @RichardScriven I'll do that soon, too – FKG Jun 27 '16 at 20:30
  • I don't understand: You already have the tables above as `data.frame`? What do you have, what do you need... Your question is quite long ;-) – Christoph Jun 28 '16 at 10:15
  • @Christoph I would like to create variables X1, X2, A1, A2, and A3..:) Those are just examples of variables I want to creates, I only created the data.frame to illustrate which variable I want. I know, too long :( – FKG Jun 28 '16 at 11:24
  • 1
    Please read [(1)](http://stackoverflow.com/help/how-to-ask) how do I ask a good question, [(2)](http://stackoverflow.com/help/mcve) How to create a MCVE as well as [(3)](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610) how to provide a minimal reproducible example in R. Then edit and improve your question accordingly. I.e., abstract from your real problem... Then we can help you ;-) At the moment it is not clear, what you have and which part is missing. – Christoph Jun 28 '16 at 11:37
  • @Christoph Thanks–very useful! I'll do that soon. – FKG Jun 28 '16 at 14:19
  • 1
    Ok. let me know. I can have a look at it on Friday if you like... – Christoph Jun 28 '16 at 15:07
  • 1
    Some fodder for your repost: The variables you are trying to create are basically dummy variables. For the ones checking for 3 0's in a row etc, make new vectors which are just lags of the original column using the base R function lag(). Then you can just make a new column which checks the values of those vectors to create your new variable -- you could use the base R function ifelse() for this, thought there are prettier solutions. For the grouping you're trying to do, see this stack overflow answer: http://stackoverflow.com/questions/26291988/r-how-to-create-a-lag-variable-for-each-by-group. – verybadatthis Jun 28 '16 at 21:33
  • The above is not a perfect way to do it, but should help you make a best effort at solving it when reposting. It's hard to give a better answer without more detail, as @Christoph mentioned. Can look again when more detail is posted. – verybadatthis Jun 28 '16 at 21:34
  • dear @Christoph and verybadatthis: I've updated the question now and limited it to two related questions. Was not easy to explain must admit. Please have a look when you have the time. Cheers – FKG Jun 30 '16 at 21:10
  • 1
    Why is `A1` zero in this line: `B, 1991, 1, 0`? – Christoph Jul 01 '16 at 06:13
  • Hi @Christoph! because the X variable never reached 4 (4 or more) for subject `B`. Basically if the X is more than 3 for any subject in the dataset, the dummy variable A1 should capture it. See for example Subject A and D. – FKG Jul 01 '16 at 08:23
  • Why the A1 is 0 for `B 1993 2 0` . In the previous line, X is 1. – akrun Jul 01 '16 at 11:34
  • @akrun because subject B never reached 4 (A1 should capture increases only if subjects reach 4 or more in X) – FKG Jul 01 '16 at 11:43
  • @FKG I posted a solution. Hope it helps. – akrun Jul 01 '16 at 12:54

2 Answers2

2

We can do this with data.table

library(data.table)
setDT(data)[, A1 := if(any(X >=4)) c(1, diff(X)) else 0, by = Subject]
data[,  A2 := if(any(X >=3))  inverse.rle(within.list(rle(Y==0), 
              values[values][lengths[values] < 5] <- 0)) else 0, by = Subject]

data[, c("Subject", "Year", "X", "A1", "Y", "A2"), with = FALSE]
#    Subject Year X A1 Y A2
# 1:       A 1990 1  1 0  0
# 2:       A 1991 1  0 0  0
# 3:       A 1992 2  1 0  0
# 4:       A 1993 3  1 1  0
# 5:       A 1994 4  1 0  0
# 6:       A 1995 4  0 0  0
# 7:       B 1990 0  0 0  0
# 8:       B 1991 1  0 0  0
# 9:       B 1992 1  0 0  0
#10:       B 1993 2  0 1  0
#11:       C 1991 1  0 0  0
#12:       C 1992 2  0 0  0
#13:       C 1993 3  0 0  0
#14:       C 1994 3  0 0  0
#15:       D 1991 1  1 0  1
#16:       D 1992 2  1 0  1
#17:       D 1993 3  1 0  1
#18:       D 1994 4  1 0  1
#19:       D 1995 5  1 0  1
#20:       D 1996 5  0 1  0
#21:       D 1997 6  1 0  0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I tried it on the bigger sample and it seems to work – many thanks!!! Although, for some reason, there are a lot of negative values generated when the X=0. I just deleted them, otherwise all looks good. – FKG Jul 01 '16 at 17:01
  • @FKG Thanks for the feedback. Probably, the X=0 condition was not taken into account. – akrun Jul 02 '16 at 04:01
  • Hi again. I'm working with this line: `setDT(data)[, A1 := if(any(X >=4)) c(1, diff(X)) else 0, by = Subject]`. Is it possible to adjust this part – _"if(any(X>=4))"_ – to something that would make the A1 capture when the X has reached 3 (not more than 3. If more than 3, then =0). I did try `"if(any(X=3))"` but then it only captured 3's in the X. I need it to capture 1's and 2's as well. – FKG Jul 14 '16 at 16:36
  • Just to be clear: if the X is more than 3 for a certain subject, for example, then the A1 should not capture it at all. Please let me if creating a new question would be a better way to address this – FKG Jul 14 '16 at 16:42
  • 1
    @FKG Can you please post it as a new question? – akrun Jul 14 '16 at 17:22
  • http://stackoverflow.com/questions/38383092/how-to-generate-a-range-variable-in-r here is the question. Any suggestions are welcome! – FKG Jul 14 '16 at 19:52
  • http://stackoverflow.com/questions/38385351/how-to-create-a-conditional-variable-in-r Have a look at this one if you have the time. I've got help with the previous one. – FKG Jul 14 '16 at 22:43
1

Does that do the job? Do you need the Structure as factor? The code below does not yet realize the change in structure e.g. from C to D.

mydata <- structure("Your code here")
mydata$max <- rep(F, nrow(mydata))
mydata$A1 <- rep(0, nrow(mydata))
mydata$A2 <- rep(0, nrow(mydata))

for (i in unique(mydata$Subject)) {
  max <- max(mydata$X[mydata$Subject == i])
  if (max >=3) {
    mydata$max[mydata$Subject == i] <- T
  }
}
mydata$A1 <- ifelse(mydata$max & c(F,diff(mydata$X) > 0), 1, 0)

A2 is still unclear (See also my edit). Hopefully this helps to get the rest done.

Christoph
  • 6,841
  • 4
  • 37
  • 89
  • many thanks for this, Christoph! I'm thinking what it does to my data. In the best of circumstances, it should realize the differences between subjects (or structure). – FKG Jul 01 '16 at 11:32
  • @FKG: If it worked, it would be great to mark the question as answered;-) – Christoph Jul 21 '16 at 18:48