I am trying to carry out an analysis for which I need to have all the columns of my data with values.
I have a list of many patients. Patients are seen in 3 possible situations: emergencies, outpatient consultations and hospitalization.
Each patient can come to these services once or several times.
The data we have are:
- Number
- Date
- diagnosis in the emergency room
- diagnosis in outpatient consultations
- diagnosis in hospitalization
The problem is that a patient comes to the emergency room, only the emergency diagnosis will be filled in by that date, external consultations and hospitalization will have an "NA". Just as if you come to outpatient clinics, you will have NA in the emergency room and hospitalization for that date (when you come to outpatient clinics).
pacient <- c(10,10,10,10,10,11,11,12,12,12); pacient
date <- as.Date(c("01/01/2018","02/01/2018", "04/04/2018", "10/05/2018", "05/09/2018", "02/01/2018", "06/08/2018", "01/01/2018", "03/01/2018", "06/08/2018"), format = "%d/%m/%Y"); date
set <- c("URG", "CEX", "CEX", "URG", "HOSP", "CEX", "URG", "CEX", "CEX", "URG")
dx_URG <- c("A", NA, NA, "B", NA, NA, "A", NA, NA, "B")
dx_CEX <- c(NA, "B", "C", NA, NA, "A", NA, "C", "B", NA)
dx_HOSP <- c(NA, NA, NA, NA, "A", NA, NA, NA,NA,NA)
DF <- data.frame(pacient, date, set, dx_URG, dx_CEX, dx_HOSP)); DF
My data:
pacient date set dx_URG dx_CEX dx_HOSP
1 10 01/01/2018 URG A <NA> <NA>
2 10 02/01/2018 CEX <NA> B <NA>
3 10 04/04/2018 CEX <NA> C <NA>
4 10 10/05/2018 URG B <NA> <NA>
5 10 05/09/2018 HOSP <NA> <NA> A
6 11 02/01/2018 CEX <NA> A <NA>
7 11 06/08/2018 URG A <NA> <NA>
8 12 01/01/2018 CEX <NA> C <NA>
9 12 03/01/2018 CEX <NA> B <NA>
10 12 06/08/2018 URG B <NA> <NA>
- Fill the NA of a column if possible, with the patient's own values. That is, patient 10 has come to the Hospital 5 times on 5 different dates. For the first visit, it generates NA in CEX and HOSP, visit 2 in URG and HOSP ... and so on. I want to fill in the missing values for patient 10 in the dx_URG column with the most recent emergency diagnosis for that patient, that is to say that you are:
A, B, B, B, B
The first value stays the same, the second is an NA becomes B (since dated 10/05/2018 has a visit to URG that give diagnosis B) and so ... That for the diagnosis in CEX of patient 10, I filled in the NA with diagnosis C and for diagnosis in HOSP with the diagnosis A. This I have achieved, in part, with the following code:
dx_remp <- lapply(DF, function(x){
setDT(DF)[, dx_URG:= na.aggregate(dx_URG, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
setDT(DF)[, dx_CEX:= na.aggregate(dx_CEX, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
setDT(DF)[, dx_HOSP:= na.aggregate(dx_HOSP, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
return(num_vist)})
The problem is that this code does not work when the patient does NOT have a diagnosis to "fill in": for example, I get an error to fill in dx_HOSP since neither patient 11 nor 12 have any value there.
The second thing I need is in relation to that, if the patients do not have values to fill in the column in which we are replacing the NA, to look for it in another of the columns: the priority would be CEX -> HOSP -> URG
In summary, I have to fill in the NA first, looking inside them column, if there were no values, search in dx_CEX, then dx_HOS, then dx_URG.
The desired result:
pacient date set dx_URG dx_CEX dx_HOSP
1 10 01/01/2018 URG A C A
2 10 02/01/2018 CEX B B A
3 10 04/04/2018 CEX B C A
4 10 10/05/2018 URG B C A
5 10 05/09/2018 HOSP B C A
6 11 02/01/2018 CEX A A A
7 11 06/08/2018 URG A A A
8 12 01/01/2018 CEX B C B
9 12 03/01/2018 CEX B B B
10 12 06/08/2018 URG B B B
For example, patient 10 in dx_cex has empty dates 1, 4 and 5; the NA of dx_cex would have to be filled with the last cex diagnosis for that patient, that is C. In patient 12 in dx_hosp it has no value in any of the citations, so it would be necessary to look for the last valid diagnosis in cex, that is to say B and fill all NA of dx_hosp with B.
Thank you