Select multiple duplicate rows based on specific values in next column

Question

This is a follow up question to Kikapp's answer.

I want to remove participant IDs which lack all the time-points. In other way around, I want to select rows which have all the four time (11, 21, 31, 41). See the sample data dropbox link

Here is my try based on Kikapp's answer. For some reason, it doesn't work. Let me know how to make it better.

data2 <- df[df$ID %in% names(table(df$ID))[table(df$ID) > 3],]

I get 4695 rows or objects or IDs for time == 11, time == 21,time == 41 while 4693 for time == 31; however, I want they should be equal.

Try: `do.call(rbind,Filter(function(x) { length(unique(x[,2])) == 4 },split(df, df$ID)))`. — Abdou, Sep 30 '16 at 18:38
or `df %>% group_by(ID) %>% dplyr::filter(length(unique(time)) == 4) %>% data.frame()` with `dplyr`. — Abdou, Sep 30 '16 at 18:45
@Abdou - Thanks! First code did not work. Second gives same result as my data2 code. I get two less rows with `time==31`. Actually all four time-points (11,21,31,41) should have same number of IDs or rows or objects. With `data2 <- df[df$ID %in% names(table(df$ID))[table(df$ID) > 3],] ` code or yours `df %>% group_by(ID) %>% dplyr::filter(length(unique(time)) == 4) %>% data.frame()` code, I get 4695 rows or objects or IDs for 11, 21,41 while 4693 for 31 time. — Aby, Sep 30 '16 at 19:01
Both the code snippets I provided do the same exact thing, so I am not sure what you mean by "_First code did not work_". It looks like you have 2 rows in your data that have `time` values of `32`. You did not mention that there are rows with values of `32` for `time`. — Abdou, Sep 30 '16 at 19:15
First code took quite long time, so I stopped it. Good point. `which(grepl(32, df$time))` gave me two places where `time == 32` (5629, 12602). Writing mistake, thanks! `df$time[df$time == 32] <- 31` worked. How did you do it? Great work! — Aby, Sep 30 '16 at 19:29
I will write up an answer to explain how I found that there were rows with `32`. — Abdou, Sep 30 '16 at 19:33

score 1 · Accepted Answer · answered Sep 30 '16 at 19:40

You can use dplyr for this task for a much faster result:

df1 <- df %>% group_by(ID) %>% 
    dplyr::filter(length(unique(time)) == 4) %>% 
    data.frame()

However, when you get the counts of ID's for each time value you will find out that there are 32's hidden there (2 rows in total):

df1 %>% group_by(time) %>% 
    dplyr::summarise(Counts = n()) %>% 
    data.frame()

#Output:
time Counts
 11   4695  
 21   4695  
 31   4693  
 32      2  
 41   4695

This shows that you have 2 rows with values of 32. As it turns out, that was due to a typo on your part. So you can change them with df$time[df$time == 32] <- 31 and run the code again.

I hope this was helpful.

Thanks!

Select multiple duplicate rows based on specific values in next column

1 Answers1