0

I have a grouped data frame (meaning that I have a list of patients and each patient has several treatments). I now want to remove all rows, that do not meet a specific criteria:

The last column contains yes and no. I only want the rows that contains the last no and the first yes.

PATIENT.ID   Caffeinefactor    
---------------------------
21       no      
21       no      
21       no
21       yes      
21       yes      
34       no      
34       no      
34       yes      
34       yes      
16       no      
16       no      
DF = structure(list(PATIENT.ID = c(210625L, 210625L, 210625L, 210625L, 
210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 
210625L, 210625L, 210625L, 210625L, 210625L, 210625L, 220909L, 
220909L, 220909L, 220909L, 220909L, 220909L, 220909L, 220909L, 
220909L, 220909L, 221179L, 221179L, 221179L, 221179L, 221179L, 
221179L, 221179L, 221179L, 221179L, 221179L, 221179L, 221179L, 
221179L, 221179L, 301705L, 301705L, 301705L, 301705L, 301705L, 
301705L, 301705L, 301705L, 301705L, 301705L, 301705L, 301705L, 
301705L, 301705L, 301705L, 303926L, 303926L, 303926L, 303926L
), PATIENT.TREATMENT.NUMBER = c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 17L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 12L, 13L, 14L, 15L, 16L, 1L, 2L, 3L, 4L), Caffeinefactor = c("no", 
"no", "no", "no", "yes", "yes", "yes", "no", "yes", "yes", "yes", 
"yes", "yes", "no", "no", "yes", "yes", "yes", "yes", "yes", 
"yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "no", 
"no", "no", "no", "no", "no", "no", "yes", "yes", "yes", "yes", 
"yes", "no", "no", "no", "no", "no", "no", "yes", "no", "yes", 
"yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "no"
)), row.names = c(NA, -60L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x7fe7f7002ee0>)

Expected Output:

PATIENT.ID   Caffeinefactor    
---------------------------    
21       no
21       yes                
34       no      
34       yes           

I started off with this code:

daf1 <- df %>% 
  group_by(PATIENT.ID) %>% 
  mutate(firstc = (Caffeinefactor == 'yes' & lag(Caffeinefactor) == 'no')) %>% 
  # Find the first row with `double_B` in each group, filter out rows after it
  filter(row_number() <= min(which(firstc == TRUE)))   

however, I cannot seem to solve the problem, to delete everything before AND after the two needed rows.

Rdog
  • 31
  • 5
  • Would it matter if the values are last or first, when you have only two columns i.e. `df %>% distinct %>% group_by(PATIENT.ID) %>% filter(n_distinct(Caffeinefactor) == 2) %>% ungroup` – akrun Mar 22 '22 at 17:19

3 Answers3

1
library(data.table)

# dummy data              
df <- data.table(patient = c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16)
                 , caffiene = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
                 )
df[, time := 1:.N][]

    patient caffiene time
 1:      21       no    1
 2:      21       no    2
 3:      21      yes    3
 4:      21      yes    4
 5:      21      yes    5
 6:      34       no    6
 7:      34       no    7
 8:      34      yes    8
 9:      34      yes    9
10:      16       no   10
11:      16       no   11

# identify 1st yes
df[caffiene == 'yes', nth_yes := 1:.N, patient]

# fetch row index of 1st 'yes' and the 'no' immediately preceeding
x <- df[, .I[nth_yes == 1 & !is.na(nth_yes)] ]
y <- x-1

# return corresponding rows
df[c(y,x)][, nth_yes := NULL][order(patient)]

   patient caffiene time
1:      21       no    2
2:      21      yes    3
3:      34       no    7
4:      34      yes    8
Sweepy Dodo
  • 1,761
  • 9
  • 15
  • Thank you for your response! However, I get an error message when I try to create row ids: "Error in `[.data.frame`(df, , `:=`(row_id = 1:.N, max_row_id = .N), PATIENT.ID) : object 'PATIENT.ID' not found" ... am I missing a package? – Rdog Mar 22 '22 at 20:02
  • np. Make sure u have loaded package `library(data.table)` and converted your dataframe to a datatable `setDT(df)` – Sweepy Dodo Mar 22 '22 at 20:07
  • Thank you, now I got a result! But I now have a similar problem like with AdroMines answer. This code gives me the first no of all nos (so always row 1 for each Patient) and the last yes of all yeses. But what I am looking for is the last no before the first yes. Is there a way to get to these two rows with your code? – Rdog Mar 22 '22 at 21:48
  • I now see what you are looking for. Shall remove my answer as I have not yet come up with a solution – Sweepy Dodo Mar 22 '22 at 22:09
  • @Rdog updated answer. Let me know if suitable – Sweepy Dodo Mar 22 '22 at 23:03
  • thank you! it worked for most of my data, but it did not work when the first row was a yes, and when there were only nows and for some reason it gave me three rows (no, yes,no) like with AdroMines answer. I printed the first 60 rows of my data in the question so that you can better understand where I am going wrong. Do you know a solution for this? – Rdog Mar 23 '22 at 09:37
0

In case there are other columns, you can try the following:

EDITED: it should retrieve the first yes and the last no before the first yes.

Edit 2: based on the provided dataset and the new conditions:

  • take the row with the first "yes"
  • if there is a "no" before "yes", then take the last no
  • if there is not a "no" before "yes", then take the first "no" after "yes"

The code is quite hacky though.

library(dplyr)

df %>% 
    # add row id to confirm if the correct rows were selected
    mutate(row_id = row_number()) %>% 
    group_by(PATIENT.ID) %>% 
    mutate(first_yes = cumsum(Caffeinefactor == "yes"),
           
           # in case there are multiple first_yes == 1,
           # then only keep the first one as 1
           first_yes = ifelse(first_yes == lag(first_yes, default = -1), -1, first_yes),
           
           # is no
           last_no = (Caffeinefactor == "no") * row_number(),
           
           # keep number as is if no appears before yes,
           # otherwise make negative to differentiate
           last_no = ifelse(first_yes == 0, last_no, -last_no),
           
           # is there a no before yes for this patient
           no_before_yes = any(Caffeinefactor == "no" & first_yes < 1)) %>% 
    
    # create conditions as variables
    mutate(
        # first yes, simple enough
        cond1 = first_yes == 1,
        
        # if no before yes, then take the last no
        cond2 = no_before_yes & (last_no == max(last_no) & first_yes < 1), 
        
        # if no after yes, then take the first no after yes
        cond3 = !no_before_yes & (last_no == min(last_no) & first_yes >= 1 & last_no < 0)  
    ) %>% 
    filter(cond1 | cond2 | cond3) %>% 
    select(-cond1, -cond2, -cond3, -first_yes, -last_no, -no_before_yes)
#> # A tibble: 8 x 4
#> # Groups:   PATIENT.ID [5]
#>   PATIENT.ID PATIENT.TREATMENT.NUMBER Caffeinefactor row_id
#>        <int>                    <int> <chr>           <int>
#> 1     210625                        1 no                  1
#> 2     210625                        6 yes                 5
#> 3     220909                        1 yes                18
#> 4     221179                        1 no                 28
#> 5     221179                       10 yes                37
#> 6     301705                        1 no                 42
#> 7     301705                        7 yes                48
#> 8     303926                        1 no                 57

Input:

df <- data.frame(
    PATIENT.ID= c(21, 21, 21, 21, 21, 34, 34, 34, 34, 16, 16), 
    Caffeinefactor = c("no", "no", "yes", "yes", "yes", "no", "no", "yes", "yes", "no", "no")
)
AdroMine
  • 1,427
  • 5
  • 9
  • Thank you! I tried running the code, but it does not work with my data, as it sometimes switches back (like this: no,no,no,no,yes,yes,yes,no,no,yes ...for one Patient). Is there anyway I can change the code so it only gives me the first change (in this case row 4 and 5)? – Rdog Mar 22 '22 at 20:05
  • so you don't want the first yes and last no, but instead the two rows where no/yes interchange? – AdroMine Mar 22 '22 at 20:15
  • I want the first yes and the row with the no before that, but no interchanges after that so that I end up with only 2 rows per Patient – Rdog Mar 22 '22 at 21:43
  • modified answer accordingly – AdroMine Mar 23 '22 at 05:18
  • Thank you so much for your help AdroMine! The new code works for most Patienten, however if the Patient startet with a yes, it gives me all treatments (but for this specific question I could basically remove all patients that start with a yes). And it sometimes gives me three rows per Patient (no, yes, no), but I have not figures out why. Is there a way to post more of my data, so that you can see what I mean? and thank again for the help – Rdog Mar 23 '22 at 08:15
  • If it's a small dataset, then you can call `dput` on it (add the output to your question). https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – AdroMine Mar 23 '22 at 09:03
  • I added the first part of the dataset, maybe it helps to find a solution – Rdog Mar 23 '22 at 09:50
  • I have edited the code, this takes care of the all provided requirements for the given data – AdroMine Mar 23 '22 at 16:16
0

I edited @Adromines code and now I resolved the issues with the "yes,no,yes" rows:

daf1 <- df %>% 
  setDT(df) %>% 
  # add row id to confirm if the correct rows were selected
  dplyr::mutate(row_id = row_number()) %>% 
  dplyr::group_by(PATIENT.ID) %>% 
  dplyr::mutate(first_yes = cumsum(Caffeinefactor == "yes"), 
         last_no = (Caffeinefactor == "no") * row_number(), 
         last_no = ifelse(first_yes == 0, last_no, 0)) %>% 
  dplyr:: filter((first_yes == 1 & Caffeine >0) | last_no == max(last_no)) %>%
  dplyr::select(-first_yes, -last_no)

Rdog
  • 31
  • 5