Filter rows per group with a specific condition

Question

I have a grouped data frame (meaning that I have a list of patients and each patient has several treatments). I now want to remove all rows, that do not meet a specific criteria:

The last column contains yes and no. I only want the rows that contains the last no and the first yes.

PATIENT.ID   Caffeinefactor    
---------------------------
21       no      
21       no      
21       no
21       yes      
21       yes      
34       no      
34       no      
34       yes      
34       yes      
16       no      
16       no

DF = structure(list(PATIENT.ID = c(210625L, 210625L, 210625L, 210625L, 
210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 
210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 220909L, 
220909L, 220909L, 220909L, 220909L, 220909L, 220909L, 220909L, 
220909L, 220909L, 221179L, 221179L, 221179L, 221179L, 221179L, 
221179L, 221179L, 221179L, 221179L, 221179L, 221179L, 221179L, 
221179L, 221179L, 301705L, 301705L, 301705L, 301705L, 301705L, 
301705L, 301705L, 301705L, 301705L, 301705L, 301705L, 301705L, 
301705L, 301705L, 301705L, 303926L, 303926L, 303926L, 303926L
), PATIENT.TREATMENT.NUMBER = c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 17L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 12L, 13L, 14L, 15L, 16L, 1L, 2L, 3L, 4L), Caffeinefactor = c("no", 
"no", "no", "no", "yes", "yes", "yes", "no", "yes", "yes", "yes", 
"yes", "yes", "no", "no", "yes", "yes", "yes", "yes", "yes", 
"yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "no", 
"no", "no", "no", "no", "no", "no", "yes", "yes", "yes", "yes", 
"yes", "no", "no", "no", "no", "no", "no", "yes", "no", "yes", 
"yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "no"
)), row.names = c(NA, -60L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x7fe7f7002ee0>)

Expected Output:

PATIENT.ID   Caffeinefactor    
---------------------------    
21       no
21       yes                
34       no      
34       yes

I started off with this code:

daf1 <- df %>% 
  group_by(PATIENT.ID) %>% 
  mutate(firstc = (Caffeinefactor == 'yes' & lag(Caffeinefactor) == 'no')) %>% 
  # Find the first row with `double_B` in each group, filter out rows after it
  filter(row_number() <= min(which(firstc == TRUE)))

however, I cannot seem to solve the problem, to delete everything before AND after the two needed rows.

Would it matter if the values are last or first, when you have only two columns i.e. `df %>% distinct %>% group_by(PATIENT.ID) %>% filter(n_distinct(Caffeinefactor) == 2) %>% ungroup` — akrun, Mar 22 '22 at 17:19

Sweepy Dodo · Answer 1 · 2022-03-22T23:07:01.337

1

library(data.table)

# dummy data              
df <- data.table(patient = c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16)
                 , caffiene = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
                 )
df[, time := 1:.N][]

    patient caffiene time
 1:      21       no    1
 2:      21       no    2
 3:      21      yes    3
 4:      21      yes    4
 5:      21      yes    5
 6:      34       no    6
 7:      34       no    7
 8:      34      yes    8
 9:      34      yes    9
10:      16       no   10
11:      16       no   11

# identify 1st yes
df[caffiene == 'yes', nth_yes := 1:.N, patient]

# fetch row index of 1st 'yes' and the 'no' immediately preceeding
x <- df[, .I[nth_yes == 1 & !is.na(nth_yes)] ]
y <- x-1

# return corresponding rows
df[c(y,x)][, nth_yes := NULL][order(patient)]

   patient caffiene time
1:      21       no    2
2:      21      yes    3
3:      34       no    7
4:      34      yes    8

edited Mar 22 '22 at 23:07

answered Mar 22 '22 at 17:29

Sweepy Dodo

1,761
9
15

Thank you for your response! However, I get an error message when I try to create row ids: "Error in `[.data.frame`(df, , `:=`(row_id = 1:.N, max_row_id = .N), PATIENT.ID) : object 'PATIENT.ID' not found" ... am I missing a package? – Rdog Mar 22 '22 at 20:02
np. Make sure u have loaded package `library(data.table)` and converted your dataframe to a datatable `setDT(df)` – Sweepy Dodo Mar 22 '22 at 20:07
Thank you, now I got a result! But I now have a similar problem like with AdroMines answer. This code gives me the first no of all nos (so always row 1 for each Patient) and the last yes of all yeses. But what I am looking for is the last no before the first yes. Is there a way to get to these two rows with your code? – Rdog Mar 22 '22 at 21:48
I now see what you are looking for. Shall remove my answer as I have not yet come up with a solution – Sweepy Dodo Mar 22 '22 at 22:09
@Rdog updated answer. Let me know if suitable – Sweepy Dodo Mar 22 '22 at 23:03
thank you! it worked for most of my data, but it did not work when the first row was a yes, and when there were only nows and for some reason it gave me three rows (no, yes,no) like with AdroMines answer. I printed the first 60 rows of my data in the question so that you can better understand where I am going wrong. Do you know a solution for this? – Rdog Mar 23 '22 at 09:37

AdroMine · Answer 2 · 2022-03-23T16:15:42.357

In case there are other columns, you can try the following:

EDITED: it should retrieve the first yes and the last no before the first yes.

Edit 2: based on the provided dataset and the new conditions:

take the row with the first "yes"
if there is a "no" before "yes", then take the last no
if there is not a "no" before "yes", then take the first "no" after "yes"

The code is quite hacky though.

library(dplyr)

df %>% 
    # add row id to confirm if the correct rows were selected
    mutate(row_id = row_number()) %>% 
    group_by(PATIENT.ID) %>% 
    mutate(first_yes = cumsum(Caffeinefactor == "yes"),
           
           # in case there are multiple first_yes == 1,
           # then only keep the first one as 1
           first_yes = ifelse(first_yes == lag(first_yes, default = -1), -1, first_yes),
           
           # is no
           last_no = (Caffeinefactor == "no") * row_number(),
           
           # keep number as is if no appears before yes,
           # otherwise make negative to differentiate
           last_no = ifelse(first_yes == 0, last_no, -last_no),
           
           # is there a no before yes for this patient
           no_before_yes = any(Caffeinefactor == "no" & first_yes < 1)) %>% 
    
    # create conditions as variables
    mutate(
        # first yes, simple enough
        cond1 = first_yes == 1,
        
        # if no before yes, then take the last no
        cond2 = no_before_yes & (last_no == max(last_no) & first_yes < 1), 
        
        # if no after yes, then take the first no after yes
        cond3 = !no_before_yes & (last_no == min(last_no) & first_yes >= 1 & last_no < 0)  
    ) %>% 
    filter(cond1 | cond2 | cond3) %>% 
    select(-cond1, -cond2, -cond3, -first_yes, -last_no, -no_before_yes)
#> # A tibble: 8 x 4
#> # Groups:   PATIENT.ID [5]
#>   PATIENT.ID PATIENT.TREATMENT.NUMBER Caffeinefactor row_id
#>        <int>                    <int> <chr>           <int>
#> 1     210625                        1 no                  1
#> 2     210625                        6 yes                 5
#> 3     220909                        1 yes                18
#> 4     221179                        1 no                 28
#> 5     221179                       10 yes                37
#> 6     301705                        1 no                 42
#> 7     301705                        7 yes                48
#> 8     303926                        1 no                 57

Input:

df <- data.frame(
    PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16), 
    Caffeinefactor = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
)

Thank you! I tried running the code, but it does not work with my data, as it sometimes switches back (like this: no,no,no,no,yes,yes,yes,no,no,yes ...for one Patient). Is there anyway I can change the code so it only gives me the first change (in this case row 4 and 5)? — Rdog, Mar 22 '22 at 20:05
so you don't want the first yes and last no, but instead the two rows where no/yes interchange? — AdroMine, Mar 22 '22 at 20:15
I want the first yes and the row with the no before that, but no interchanges after that so that I end up with only 2 rows per Patient — Rdog, Mar 22 '22 at 21:43
Thank you so much for your help AdroMine! The new code works for most Patienten, however if the Patient startet with a yes, it gives me all treatments (but for this specific question I could basically remove all patients that start with a yes). And it sometimes gives me three rows per Patient (no, yes, no), but I have not figures out why. Is there a way to post more of my data, so that you can see what I mean? and thank again for the help — Rdog, Mar 23 '22 at 08:15
If it's a small dataset, then you can call `dput` on it (add the output to your question). https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — AdroMine, Mar 23 '22 at 09:03
I added the first part of the dataset, maybe it helps to find a solution — Rdog, Mar 23 '22 at 09:50
I have edited the code, this takes care of the all provided requirements for the given data — AdroMine, Mar 23 '22 at 16:16

score 0 · Accepted Answer · answered Mar 23 '22 at 13:01

I edited @Adromines code and now I resolved the issues with the "yes,no,yes" rows:

daf1 <- df %>% 
  setDT(df) %>% 
  # add row id to confirm if the correct rows were selected
  dplyr::mutate(row_id = row_number()) %>% 
  dplyr::group_by(PATIENT.ID) %>% 
  dplyr::mutate(first_yes = cumsum(Caffeinefactor == "yes"), 
         last_no = (Caffeinefactor == "no") * row_number(), 
         last_no = ifelse(first_yes == 0, last_no, 0)) %>% 
  dplyr:: filter((first_yes == 1 & Caffeine >0) | last_no == max(last_no)) %>%
  dplyr::select(-first_yes, -last_no)

Filter rows per group with a specific condition

3 Answers3