Replace NA with conditions

Question

I am trying to carry out an analysis for which I need to have all the columns of my data with values.

I have a list of many patients. Patients are seen in 3 possible situations: emergencies, outpatient consultations and hospitalization.

Each patient can come to these services once or several times.

The data we have are:

Number
Date
diagnosis in the emergency room
diagnosis in outpatient consultations
diagnosis in hospitalization

The problem is that a patient comes to the emergency room, only the emergency diagnosis will be filled in by that date, external consultations and hospitalization will have an "NA". Just as if you come to outpatient clinics, you will have NA in the emergency room and hospitalization for that date (when you come to outpatient clinics).

 pacient <- c(10,10,10,10,10,11,11,12,12,12); pacient
date <- as.Date(c("01/01/2018","02/01/2018", "04/04/2018", "10/05/2018", "05/09/2018", "02/01/2018", "06/08/2018", "01/01/2018", "03/01/2018", "06/08/2018"), format = "%d/%m/%Y"); date 
set <- c("URG", "CEX", "CEX", "URG", "HOSP", "CEX", "URG", "CEX", "CEX", "URG")
dx_URG <- c("A", NA, NA, "B", NA, NA, "A", NA, NA, "B")
dx_CEX <- c(NA, "B", "C", NA, NA, "A", NA, "C", "B", NA)
dx_HOSP <- c(NA, NA, NA, NA, "A", NA, NA, NA,NA,NA)

DF <- data.frame(pacient, date, set, dx_URG, dx_CEX, dx_HOSP)); DF

My data:

        pacient    date      set   dx_URG dx_CEX dx_HOSP
1       10     01/01/2018    URG      A   <NA>   <NA>
2       10     02/01/2018    CEX   <NA>      B   <NA>
3       10     04/04/2018    CEX   <NA>      C   <NA>
4       10     10/05/2018    URG      B   <NA>   <NA>
5       10     05/09/2018    HOSP   <NA>   <NA>    A
6       11     02/01/2018    CEX   <NA>      A   <NA>
7       11     06/08/2018    URG      A   <NA>   <NA>
8       12     01/01/2018    CEX   <NA>      C   <NA>
9       12     03/01/2018    CEX   <NA>      B   <NA>
10      12     06/08/2018    URG      B   <NA>   <NA>

Fill the NA of a column if possible, with the patient's own values. That is, patient 10 has come to the Hospital 5 times on 5 different dates. For the first visit, it generates NA in CEX and HOSP, visit 2 in URG and HOSP ... and so on. I want to fill in the missing values for patient 10 in the dx_URG column with the most recent emergency diagnosis for that patient, that is to say that you are:

A, B, B, B, B

The first value stays the same, the second is an NA becomes B (since dated 10/05/2018 has a visit to URG that give diagnosis B) and so ... That for the diagnosis in CEX of patient 10, I filled in the NA with diagnosis C and for diagnosis in HOSP with the diagnosis A. This I have achieved, in part, with the following code:

dx_remp <- lapply(DF, function(x){
  setDT(DF)[, dx_URG:= na.aggregate(dx_URG, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
  setDT(DF)[, dx_CEX:= na.aggregate(dx_CEX, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
  setDT(DF)[, dx_HOSP:= na.aggregate(dx_HOSP, FUN=function(x){ tail(x[!is.na(x)],1)}), by = pacient]
  return(num_vist)})

The problem is that this code does not work when the patient does NOT have a diagnosis to "fill in": for example, I get an error to fill in dx_HOSP since neither patient 11 nor 12 have any value there.

The second thing I need is in relation to that, if the patients do not have values to fill in the column in which we are replacing the NA, to look for it in another of the columns: the priority would be CEX -> HOSP -> URG

In summary, I have to fill in the NA first, looking inside them column, if there were no values, search in dx_CEX, then dx_HOS, then dx_URG.

The desired result:

       pacient    date      set   dx_URG dx_CEX dx_HOSP
1       10     01/01/2018    URG      A      C     A
2       10     02/01/2018    CEX      B      B     A
3       10     04/04/2018    CEX      B      C     A
4       10     10/05/2018    URG      B      C     A
5       10     05/09/2018    HOSP     B      C     A
6       11     02/01/2018    CEX      A      A     A
7       11     06/08/2018    URG      A      A     A
8       12     01/01/2018    CEX      B      C     B
9       12     03/01/2018    CEX      B      B     B
10      12     06/08/2018    URG      B      B     B

For example, patient 10 in dx_cex has empty dates 1, 4 and 5; the NA of dx_cex would have to be filled with the last cex diagnosis for that patient, that is C. In patient 12 in dx_hosp it has no value in any of the citations, so it would be necessary to look for the last valid diagnosis in cex, that is to say B and fill all NA of dx_hosp with B.

Thank you

I'm curious, your `dx_remp` attempt uses `data.table` syntax but your sample data is just a `data.frame`? Are you using `data.table`? (note that if you do use `data.table`, you only need to `setDT` once. — Gregor Thomas, Mar 08 '19 at 12:20
Also, a couple references, what you need to do is combine two common procedures, (1) filling NAs based on the last observation, [as in this R-FAQ](https://stackoverflow.com/q/7735647/903061), and (2) "coalesescing" non-missing values across columns, as [in this question](https://stackoverflow.com/q/19253820/903061). — Gregor Thomas, Mar 08 '19 at 12:23

Gregor Thomas · Answer 1 · 2019-03-12T14:26:04.460

Adaptation based on new data. We create a helper function that fills NA values with the last non-NA value (if any), and use that instead of fill:

fill_with_last = function(x)  {
  if (any(!is.na(x))) x[is.na(x)] = tail(na.omit(x), 1)
  return(x)
}

DF %>% 
  # fix column classes (just in case)
  mutate_if(is.factor, as.character) %>%
  # ensure order is correct
  arrange(pacient, date) %>%
  # by pacient
  group_by(pacient) %>%
  # fill in the diagnosis variables with last observation
  mutate_at(vars(starts_with("dx")), fill_with_last) %>%
  # coalesce in priority order to fill in any blanks
  mutate(dx_URG = coalesce(dx_URG, dx_CEX, dx_HOSP),
         dx_CEX = coalesce(dx_CEX, dx_HOSP, dx_URG),
         dx_HOSP = coalesce(dx_HOSP, dx_CEX, dx_URG))

# # A tibble: 10 x 6
# # Groups:   pacient [3]
#    pacient date       set   dx_URG dx_CEX dx_HOSP
#      <dbl> <date>     <chr> <chr>  <chr>  <chr>  
#  1      10 2018-01-01 URG   A      C      A      
#  2      10 2018-01-02 CEX   B      B      A      
#  3      10 2018-04-04 CEX   B      C      A      
#  4      10 2018-05-10 URG   B      C      A      
#  5      10 2018-09-05 HOSP  B      C      A      
#  6      11 2018-01-02 CEX   A      A      A      
#  7      11 2018-08-06 URG   A      A      A      
#  8      12 2018-01-01 CEX   B      C      C      
#  9      12 2018-01-03 CEX   B      B      B      
# 10      12 2018-08-06 URG   B      B      B

Note there is a discrepancy in dx_Hosp row 8. Your desired result shows "B", but the logic you described (as implemented by both me and sindri) results in "C" because dx_CEX comes before dx_URG in the priority order.

Here is the original solution from when I thought you wanted to fill with the next diagnosis, not the last diagnosis. It uses tidyr for its fill function (fills in missing values with previous observation) and dplyr for its coalesce function (takes the first non-missing value across columns):

library(tidyr)
library(dplyr)

DF %>%
  # convert any factor columns to character for easy modification
  mutate_if(is.factor, as.character) %>%
  # make sure things are in the right order
  arrange(pacient, date) %>%
  # do subsequent operations "by pacient"
  group_by(pacient) %>% 
  # fill in missing values "up" with the most recent observation,
  # then fill "down" if there are other holes
  fill(dx_URG, dx_CEX, dx_HOSP, .direction = "up") %>%
  fill(dx_URG, dx_CEX, dx_HOSP, .direction = "down") %>%
  # "coalesce" column in the order of preference
  mutate(dx_URG = coalesce(dx_URG, dx_CEX, dx_HOSP),
         dx_CEX = coalesce(dx_CEX, dx_HOSP, dx_URG),
         dx_HOSP = coalesce(dx_HOSP, dx_CEX, dx_URG))
# # A tibble: 10 x 6
# # Groups:   pacient [3]
#    pacient date       set   dx_URG dx_CEX dx_HOSP
#      <dbl> <date>     <chr> <chr>  <chr>  <chr>  
#  1      10 2018-01-01 URG   A      B      A      
#  2      10 2018-01-02 CEX   B      B      A      
#  3      10 2018-04-04 CEX   B      C      A      
#  4      10 2018-05-10 URG   B      C      A      
#  5      10 2018-09-05 HOSP  B      C      A      
#  6      11 2018-01-02 CEX   A      A      A      
#  7      11 2018-08-06 URG   A      A      A      
#  8      12 2008-01-01 CEX   B      C      C      
#  9      12 2018-01-03 CEX   B      B      B      
# 10      12 2018-08-06 URG   B      B      B

A couple data notes. In your code, you had year 2008 in row 5, but in your table you had 2018 like all the rest. I changed 2008 to 2018 to match the table.

When you do cbind(), it converts everything to matrix, which loses your column classes. This is bad. It is much better to call data.frame() directly than to use as.data.frame(cbind()).

# change this
DF <- as.data.frame(cbind(pacient, date, set, dx_URG, dx_CEX, dx_HOSP))
# to this
DF <- data.frame(pacient, date, set, dx_URG, dx_CEX, dx_HOSP)

The code works perfect with the values that are in the DF, but when adding more patients, I have problems with the replacement. When replacing with values of "other diagnostic columns" because in his he has neither previous nor later diagnosis (what is achieved with fill up and down), he gives me errors and copies all the diagnoses by date, that is, if I have to replace in cex and hosp because I only have a diagnosis in urg, and in urg I have a diagnosis of ABCDE, instead of copying in cex and hosp EEEEE, copy ABCDE. — Anne, Mar 11 '19 at 08:58
I have trouble understanding the issue from your comment. Please update the data in your question to show the problem, and also please include the desired result. — Gregor Thomas, Mar 11 '19 at 14:10
@sindri_baldur yes, thank you. As you can see in the comments above the answer worked for what OP originally posted. They said it didn't work in the full case. I asked them to update the question. They have now updated the question with new data, but I have not updated the answer yet. — Gregor Thomas, Mar 12 '19 at 13:18

score 0 · Accepted Answer · answered Mar 12 '19 at 13:22

Thought this was a a fun problem and came up with this data.table solution that relies on dplyr::case_when() for readability:

# Setup ----
# Packages
library(dplyr)
library(data.table)

# Convert to data.table
setDT(DF)
# Get the order of in terms of pacient and date
setkey(DF, pacient, date)
# Convert factors to characters (easier to work with)
factors <- names(DF)[sapply(DF, is.factor)]
DF[,(factors):= lapply(.SD, as.character), .SDcols = factors]
# Define helper function
ext_most_recent <- function(x) tail(x[!is.na(x)], 1) 
# Which columns to update
cols2fill <- c("dx_URG", "dx_CEX", "dx_HOSP")

# Update columns ----
DF[, (cols2fill) := lapply(
                      .SD, 
                      function(x) {
                        most_recent <- ext_most_recent(x)[1]
                        case_when(
                          !is.na(x) ~ x,
                          !is.na(most_recent) ~ most_recent,
                          !is.na(dx_CEX) ~ dx_CEX,
                          !is.na(dx_HOSP) ~ dx_HOSP,
                          !is.na(dx_URG) ~ dx_URG,
                          TRUE ~ NA_character_
                        )
                      }
                    ), 
   by = pacient, 
   .SDcols = c("dx_URG", "dx_CEX", "dx_HOSP")]

#  1:      10 2018-01-01  URG      A      C       A
#  2:      10 2018-01-02  CEX      B      B       A
#  3:      10 2018-04-04  CEX      B      C       A
#  4:      10 2018-05-10  URG      B      C       A
#  5:      10 2018-09-05 HOSP      B      C       A
#  6:      11 2018-01-02  CEX      A      A       A
#  7:      11 2018-08-06  URG      A      A       A
#  8:      12 2018-01-01  CEX      B      C       C
#  9:      12 2018-01-03  CEX      B      B       B
# 10:      12 2018-08-06  URG      B      B       B

Replace NA with conditions

2 Answers2