0

This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now. Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)

> data <- data.frame(id = rep(c(1:10), 3),
                     T  = gl(3, 10),
                     X  = sample(1:30),
                     Y  = sample(c("yes", "no"), 30, replace = TRUE),
                     Z  = sample(1:40, 30),
                     Z2 = rnorm(30, mean = 5, sd = 0.5))

    > head(data)
      id T  X   Y  Z       Z2
    1  1 1 10 yes 15 5.993605
    2  2 1 18  no 22 6.096566
    3  3 1  5  no 24 5.101393
    4  4 1 15 yes 18 4.944108
    5  5 1 23  no 34 4.634176
    6  6 1 13  no 27 5.576015

I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.

I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?

Thank you very much in advance to all. Have a great day,

Denis

flodel
  • 87,577
  • 21
  • 185
  • 223
den
  • 169
  • 1
  • 1
  • 9

1 Answers1

2

You can do:

keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])

Or use the plyr package:

library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Thank you Flodel, this is what i wanted to do and this works very well. (i tried your "plyr suggestion"). – den May 26 '13 at 17:38
  • Actually, this solution does work in the example i proposed, but not in the real data frame i am working on : is there another way to select using a syntax like this one : subset(data, "Y=="yes" %in% T=="1" & "Y=="yes" %in% T=="2" & "Y=="yes" %in% T=="3") "? Thank you in advance if you can help me. Denis – den Jun 02 '13 at 16:49
  • You are not saying why it does not work with your real data. Can you provide a small data sample that exhibits the problem? Maybe edit your question, or rather start a new one since I already answered this one correctly... Un-accepting my answer was a little uncalled for in my opinion. – flodel Jun 02 '13 at 16:55
  • i am sorry, your answer worked well, and i just mentionned it. How can i accept it ? i will do it. Sorry, i am not familiar with how this forum exactly works. – den Jun 02 '13 at 17:01
  • With my data, the data.frame returned by R contains rows that are not three repetitions of the same ID, as it should be the case (one for T=="1", one for T=="2" and one for T=="3"). I get sometimes only one row for a given id. – den Jun 02 '13 at 17:06
  • Hard to tell without sample data, but maybe try to replace `FUN` above with `function(x)all(x == "yes") & length(x) == 3`. Or in the `plyr` case, use `function(x) if(all(x$Y == "yes") & all(x$T %in% 1:3)) x else NULL`. To accept an answer, check the big tick mark to the left. – flodel Jun 02 '13 at 17:12
  • Thank you again. I'll try that and come back to you. – den Jun 02 '13 at 17:16
  • I just tried that, but it doest work: now, i get no row in the data.frame. In fact, i don't understand how this ddply function is built, although it worked using the example. For instance, i don't understand why we have to mention the "id" variable : the selection does not rely on it. – den Jun 02 '13 at 17:25