I have a dataset where I am attempting to determine the earliest diagnosis of disease, as well as the code associated with that diagnosis. This is a much shorter version of the file I am working with.
Unfortunately, the first disease code is not always the earliest diagnosis, as can be seen by ID 1005
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
Disease_code_1 = c('I802', 'G200','I802',NA, 'H356'),
Disease_code_2 = c('A071',NA,'G20',NA,'I802'),
Disease_code_3 = c('H250', NA,NA,NA,NA),
Date_of_diagnosis_1 = c('12/06/1997','13/06/1997','14/02/2003',NA,'18/03/2005'),
Date_of_diagnosis_2 = c('12/06/1998',NA,'18/09/2001',NA,'12/07/1993'),
Date_of_diagnosis_3 = c('17/09/2010',NA,NA,NA,NA))
ID Disease_code_1 Disease_code_2 Disease_code_3 Date_of_diagnosis_1 Date_of_diagnosis_2 Date_of_diagnosis_3
1 1001 I802 A071 H250 12/06/1997 12/06/1998 17/09/2010
2 1002 G200 <NA> <NA> 13/06/1997 <NA> <NA>
3 1003 I802 G20 <NA> 14/02/2003 18/09/2001 <NA>
4 1004 <NA> <NA> <NA> <NA> <NA> <NA>
5 1005 H356 I802 <NA> 18/03/2005 12/07/1993 <NA>
I have attempted to create multiple subsets of my variables for each code and date as shown below, row bind and then only keep the earliest diagnoses, however it is quite lengthy when considering all of my other covariates and variables that I need to include.
Disease_1 <- as.data.frame((cbind(df$ID, df$Disease_code_1, df$Date_of_diagnosis_1)))
Disease_2 <- as.data.frame((cbind(df$ID, df$Disease_code_2, df$Date_of_diagnosis_2)))
Disease_3 <- as.data.frame((cbind(df$ID, df$Disease_code_3, df$Date_of_diagnosis_3)))
Disease_data <- rbind(Disease_1, Disease_2, Disease_3)
colnames(Disease_data) = c("id","Disease_code","Date_of_diagnosis")
#Edit Diseasedate to only include a participant once based on earliest diagnosis
Disease_data <- Disease_data [order(Disease_data [,'id'],Disease_data [,'Date_of_diagnosis']),]
Disease_data <- Disease_data [!duplicated(Disease_data $id),]
This is a simplified version, but I would have over 25 iterations of the Disease_
data frames, each with approximately 100 variables per data frame which works, but is very chunky and if possible would like to make it more succinct.
I understand that editing the data to include only the earliest participant is already very succinct, but it is the set-up for this method. Is there a way to use the startswith
command that may work? I have attempted but with no success.