0

I am cleaning up data from a research study. Each subject has a unique study identification number (studyid). They had varying number of visits from 1 to 3. This is what the first two columns look like:

> head(x[1:2])
# A tibble: 6 x 2
  studyid  visit_name       
  <fct>    <chr>                   
1 3383-002 screening_visit          
2 3383-002 medication_visit          
3 3383-002 follow-up_visit          
4 3383-007 screening_visit         
5 3383-008 medication_visit          
6 3383-009 medication_visit

I want to designate the screening_visit as baseline if that is present for that individual subject; if not, designate the medication_visit as baseline, and if not, then designate the follow-up visit as baseline.

I can group_by studyid and obtain groups of up to 3 rows for each subject, but I don't see a way to perform a logical query on those 3 rows simultaneously, return a value and then modify one element of a variable based on the answer.

I can see using mutate but it only works on one row at a time. I also read up about map and other iterative tools but cannot see how they could be applied here. Please help me solve or point me in the direction of reading that might help me.

Alan
  • 13
  • 3
  • 2
    What exactly do you want to achieve? What would be the expected outcome? – deschen Dec 02 '20 at 23:26
  • I think Greg's answer below most closely matches what I was trying to achieve with his "baseline_flag". Thanks! – Alan Dec 03 '20 at 23:58

2 Answers2

0

is this what you're looking for?

library(tidyverse)
my_df<- data.frame(
  stringsAsFactors = FALSE,
           studyid = c("3383-002","3383-002",
                       "3383-002","3383-007","3383-008","3383-009"),
        visit_name = c("screening_visit",
                       "medication_visit","follow-up_visit","screening_visit",
                       "medication_visit","medication_visit")
)

glimpse(my_df)
#> Rows: 6
#> Columns: 2
#> $ studyid    <chr> "3383-002", "3383-002", "3383-002", "3383-007", "3383-00...
#> $ visit_name <chr> "screening_visit", "medication_visit", "follow-up_visit"...

my_df %>% 
  mutate(visit_name=fct_inorder(visit_name)) %>% 
  mutate(visit_name_num=as.numeric(visit_name)) %>% 
  group_by(studyid) %>% 
  arrange(visit_name_num, .by_group=T) %>% 
  mutate(baseline=visit_name[1])
#> # A tibble: 6 x 4
#> # Groups:   studyid [4]
#>   studyid  visit_name       visit_name_num baseline        
#>   <chr>    <fct>                     <dbl> <fct>           
#> 1 3383-002 screening_visit               1 screening_visit 
#> 2 3383-002 medication_visit              2 screening_visit 
#> 3 3383-002 follow-up_visit               3 screening_visit 
#> 4 3383-007 screening_visit               1 screening_visit 
#> 5 3383-008 medication_visit              2 medication_visit
#> 6 3383-009 medication_visit              2 medication_visit

Created on 2020-12-03 by the reprex package (v0.3.0)

zoowalk
  • 2,018
  • 20
  • 33
0

this might help you. I really like using the data.table library for things like this.

library(data.table)

sample <- read.table(header=TRUE, text='
row  studyid  visit_name       
<int>  <fct>    <chr>                   
1 3383-002 screening_visit          
2 3383-002 medication_visit          
3 3383-002 follow-up_visit          
4 3383-007 screening_visit         
5 3383-008 medication_visit          
6 3383-009 medication_visit

  ')
rank <- read.table(header=TRUE, text='
rank  visit_name       
<int>  <chr>                   
1  screening_visit          
2  medication_visit          
3  follow-up_visit          
')


sample <-merge(sample,rank,by="visit_name",all = FALSE) #MERGE IN HEIRACHY OF VISIT
sample <-  data.table(sample) #convert to datatable
sample[,BASELINE_FLAG := ifelse(rank == min(rank),1,0), by=.(studyid)] #add in the new variable identifying the baseline visit

print(sample[order(studyid)])

         visit_name   row  studyid  rank BASELINE_FLAG
1:  follow-up_visit     3 3383-002     3             0
2: medication_visit     2 3383-002     2             0
3:  screening_visit     1 3383-002     1             1
4:  screening_visit     4 3383-007     1             1
5: medication_visit     5 3383-008     2             1
6: medication_visit     6 3383-009     2             1
7:            <chr> <int>    <fct> <int>             1
}
Greg J
  • 1
  • 2
  • Thanks Greg. This is exactly what I wanted. However, when I copy your code verbatim, it works fine until the last step - it does not create the new variable (BASELINE_FLAG) like it did for you. "rank" is there and is correct. – Alan Dec 03 '20 at 23:50
  • I have seen this happen where the table has the new column but the the data frame displayed in the environment variables panel (top right side) doesn't show it. I assume that it looks right when you ran the "print(sample[order(studyid)])" line. I normally force a refresh by clicking the refresh button in the top corner of the environment variables panel. Once refreshed the new column will show up in the column list. You won't need to refresh to use to column (or print/export or view the table with the column), the refresh will allow the column show up in the variables list. – Greg J Dec 04 '20 at 14:17
  • Yup it all works beautifully now - this seems to be the most elegant solution. Thanks! – Alan Dec 05 '20 at 15:32
  • I also realize that I can do the same thing in dplyr as follows: ```sample %>% left_join(rank) %>% group_by(studyid) %>% mutate(BASELINE_FLAG = if_else(rank == min(rank), 1, 0))``` Is there any advantage to using data.table other than speed (I only ask because I am not familiar with data.table and it would be a whole new syntax to learn)? – Alan Dec 05 '20 at 19:32
  • That is a good point, I am not expert enough to know for sure. I think that they are different approaches to meet many of the same end goals. I have not read this carefully but this thread highlights that many people ask the same question. I do like data table a lot and once I got used to the syntax felt like it was really easy and fast but I don't know dplyr as well. https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly – Greg J Dec 07 '20 at 15:53