2

I have a dataset where I am attempting to determine the earliest diagnosis of disease, as well as the code associated with that diagnosis. This is a much shorter version of the file I am working with.

Unfortunately, the first disease code is not always the earliest diagnosis, as can be seen by ID 1005

df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
                Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
                Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
                Disease_code_3 = c('H250', NA,NA,NA,NA),
                Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
                Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
                Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))

    ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001           I802           A071           H250          12/06/1997          12/06/1998          17/09/2010
2 1002           G200           <NA>           <NA>          13/06/1997                <NA>                <NA>
3 1003           I802            G20           <NA>          14/02/2003          18/09/2001                <NA>
4 1004           <NA>           <NA>           <NA>                <NA>                <NA>                <NA>
5 1005           H356           I802           <NA>          18/03/2005          12/07/1993                <NA>

I have attempted to create multiple subsets of my variables for each code and date as shown below, row bind and then only keep the earliest diagnoses, however it is quite lengthy when considering all of my other covariates and variables that I need to include.

Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))

Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")

#Edit Diseasedate to only include a participant once based on earliest diagnosis 
Disease_data <- Disease_data [order(Disease_data [,'id'],Disease_data [,'Date_of_diagnosis']),]
Disease_data <- Disease_data [!duplicated(Disease_data $id),]

This is a simplified version, but I would have over 25 iterations of the Disease_ data frames, each with approximately 100 variables per data frame which works, but is very chunky and if possible would like to make it more succinct.

I understand that editing the data to include only the earliest participant is already very succinct, but it is the set-up for this method. Is there a way to use the startswith command that may work? I have attempted but with no success.

Nic
  • 63
  • 3

2 Answers2

3

Your data is so hard to work with here because it's in wide form - convert it to long-form and it will be much easier to solve. :

Using the dplyr package in tidyverse, we take all the columns other than ID and pivot them into name-value pairs, each with an individual row. Since the example variable names have the pattern [TYPE OF VALUE]_[SEQUENTIAL NUMBER] we can use a regular expression in the names_pattern argument to split the value name into a name column which has the type of measure (here contining either "Disease_code" or "Date_of_diagnosis") and that sequential number.

Then, we pivot it wider, to make a single Disease_code and Date_of_diagnosis column, with 1 row for each ID and number combination. Finally we convert Date_of_diagnosis to a Date format (using lubridate::dmy since it's in Day/Month/Year format) so that it's ordered properly, and then filter out the rows where Disease_code is NA, meaning there was no value there:

library(tidyverse)

df2 <- df %>%
    pivot_longer(-ID,
                 names_pattern = "(.*)_([0-9]*$)",
                 names_to = c('name', 'num')) %>%
    pivot_wider() %>%
    mutate(Date_of_diagnosis = dmy(Date_of_diagnosis)) %>%
    filter(!is.na(Disease_code))

df2
# A tibble: 8 × 4
     ID num   Disease_code Date_of_diagnosis
  <dbl> <chr> <chr>        <date>           
1  1001 1     I802         1997-06-12       
2  1001 2     A071         1998-06-12       
3  1001 3     H250         2010-09-17       
4  1002 1     G200         1997-06-13       
5  1003 1     I802         2003-02-14       
6  1003 2     G20          2001-09-18       
7  1005 1     H356         2005-03-18       
8  1005 2     I802         1993-07-12       

Now that the data is properly formatted, we can just group by ID and pick the earliest date (minimum value) for each:

df2 %>%
    group_by(ID) %>%
    slice_min(Date_of_diagnosis)

     ID num   Disease_code Date_of_diagnosis
  <dbl> <chr> <chr>        <date>           
1  1001 1     I802         1997-06-12       
2  1002 1     G200         1997-06-13       
3  1003 2     G20          2001-09-18       
4  1005 2     I802         1993-07-12       
divibisan
  • 11,659
  • 11
  • 40
  • 58
2

A similar approach to @divibisan, but more concise. You can make use of the .value helper so that a single pivot_longer is enough.

library(tidyverse)

df |> 
  pivot_longer(-ID, names_pattern = "(.*)_(\\d+)", names_to = c(".value", "group")) |> 
  slice_min(dmy(Date_of_diagnosis), by = "ID", na_rm = T)

# A tibble: 4 × 4
     ID group Disease_code Date_of_diagnosis
  <dbl> <chr> <chr>        <chr>            
1  1001 1     I802         12/06/1997       
2  1002 1     G200         13/06/1997       
3  1003 2     G20          18/09/2001       
4  1005 2     I802         12/07/1993  
benson23
  • 16,369
  • 9
  • 19
  • 38