R, find character string from vector, return true/false column and a multiple choice column

Question

This question is very similar to one I had previously asked here: R, find character string from vector, create new TRUE/FALSE columns and that solution worked perfectly, but now I have an added twist.

    df<-structure(list(Date = c("5/20/2019", "5/20/2019", "5/20/2019", 
"5/20/2019", "5/20/2019", "5/20/2019", "5/20/2019", "5/20/2019", 
"10/22/2018", "5/20/2019"), ESRD_1 = c("CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA, NA, NA, NA, NA, NA, NA), ESRD_10 = c(NA, NA, NA, 
NA, NA, NA, NA, NA, "End stage renal disease on dialysis [N18.6, Z99.2]; End stage renal disease on dialysis [N18.6, Z99.2]; Type 2 diabetes mellitus with stage 4 chronic kidney disease, with long-term current use of insulin [E11.22, N18.4, Z79.4]; Type 2 diabetes mellitus with stage 4 chronic kidney disease, with long-term current use of insulin [E11.22, N18.4, Z79.4]; Type 2 diabetes mellitus with stage 4 chronic kidney disease, with long-term current use of insulin [E11.22, N18.4, Z79.4]", 
NA), ESRD_11 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, "ESRD (end stage renal disease) on dialysis [N18.6, Z99.2]; ESRD (end stage renal disease) on dialysis [N18.6, Z99.2]; Peripheral vascular disease due to secondary diabetes [E13.51]; Type 2 diabetes mellitus with chronic kidney disease on chronic dialysis, with long-term current use of insulin [E11.22, N18.6, Z99.2, Z79.4]; Type 2 diabetes mellitus with chronic kidney disease on chronic dialysis, with long-term current use of insulin [E11.22, N18.6, Z99.2, Z79.4]; Type 2 diabetes mellitus with chronic kidney disease on chronic dialysis, with long-term current use of insulin [E11.22, N18.6, Z99.2, Z79.4]"
), ESRD_12 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ESRD_13 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), ESRD_2 = c(NA, "CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30]; Diabetic gastroparesis [E11.43, K31.84]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA, NA, NA, NA, NA, NA), ESRD_3 = c(NA, NA, "CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA, NA, NA, NA, NA), ESRD_4 = c(NA, NA, NA, "CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30];", 
NA, NA, NA, NA, NA, NA), ESRD_5 = c(NA, NA, NA, NA, "CKD (chronic kidney disease), stage II [N18.20]; CKD (chronic kidney disease), stage II [N18.20]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA, NA, NA), ESRD_6 = c(NA, NA, NA, NA, NA, "CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA, NA), ESRD_7 = c(NA, NA, NA, NA, NA, NA, "CKD (chronic kidney disease), stage III [N18.30]; CKD (chronic kidney disease), stage III [N18.30]; Diabetic gastroparesis [E11.43, K31.84]; Type 2 diabetes mellitus [E11.9]", 
NA, NA, NA), ESRD_8 = c(NA, NA, NA, NA, NA, NA, NA, "CKD (chronic kidney disease), stage II [N18.20]; CKD (chronic kidney disease), stage II [N18.20];", 
NA, NA), ESRD_9 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), spec = structure(list(
    cols = list(Date = structure(list(), class = c("collector_character", 
    "collector")), ESRD_1 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_10 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_11 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_12 = structure(list(), class = c("collector_logical", 
    "collector")), ESRD_13 = structure(list(), class = c("collector_logical", 
    "collector")), ESRD_2 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_3 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_4 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_5 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_6 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_7 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_8 = structure(list(), class = c("collector_character", 
    "collector")), ESRD_9 = structure(list(), class = c("collector_logical", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

I want to look through those "ESRD" columns and look for specific patterns, creating TRUE/FALSE columns of whether specific patterns are found. For instance R would look through the ESRD columns for either the word "diabetes" or the code "E11.9" and it would create a true/false diabetes column based on what it found.

I could do that, the code would be very similar to what was posted in that other question.

My question now is: for one of those particular columns I am creating now, it wouldn't just be true/false. I'd like to create a column that was labeled "CKD" that has possible values of: "No" (if a code was not found), "Stage 1", "Stage 2", "stage 3", "Stage 4", "Stage 5", "End Stage Renal Disease" and "Unspecified" (a specific code for unspecified). Following these rules:

So my end result would look like this:

You could call your column CKD_Stage and scan for roman numerals 1-5, ESRD & other conditions [[regex roman numerals](https://stackoverflow.com/questions/49147890/how-to-check-if-a-string-contains-roman-numerals-in-r) . — Chris, Aug 09 '21 at 20:38
For row 9, it states `"End stage renal disease on dialysis [N18.6, Z99.2];`, not clear why you take the `Stage 4` there — akrun, Aug 09 '21 at 21:21

akrun · Accepted Answer · 2021-08-09T20:52:41.580

For the flag, we can use str_detect with if_any, then create the 'N_val' column by extracting the substring that matches 'N' followed by one or digits, (\\d+), a dot (\\.) and one or more digits, coalecse them and join on a key/val dataset to return the corresponding 'stage'

library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
keydat <- tibble(N_val = str_c('N18.', c(1:6, 9)), 
       CRD = c(str_c('Stage ', 1:5), 'ESRD', 'unspecified'))

out <- df %>% 
    mutate(Diabetes = replace_na(if_any(starts_with("ESRD"),
     ~ str_detect(., regex('diabetes|E11\\.9', ignore_case = TRUE))), FALSE), 
    N_val = invoke(coalesce, across(starts_with("ESRD"), 
               ~ str_remove(str_extract(., "N\\d+\\.\\d+"), "0+$")))) %>% 
    left_join(keydat)

-output

> out$Diabetes
 [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
> out$CRD
 [1] "Stage 3" "Stage 3" "Stage 3" "Stage 3" "Stage 2" "Stage 3" "Stage 3" "Stage 2" "ESRD"    "ESRD"

I think the OP wanted to populate a further column (CKD), essentially summarizing the 'how bad is it?', derived from what was in the strings. — Chris, Aug 09 '21 at 20:48
@Chris Yes, I updated that, but the values in his data are `N18.30` etc — akrun, Aug 09 '21 at 20:48

Serkan · Answer 2 · 2021-08-09T21:24:33.587

I wanted to give it a shot aswell using unite from tidyr,

library(tidyverse)


df %>% unite(
        col = "dummy",
        na.rm = TRUE,
        -"Date"
) %>% mutate(
        diabetes = case_when(
                str_detect(dummy, pattern = "diabetes|E11.9") ~ TRUE,
                TRUE ~ FALSE
        ),
        CKD = str_extract(
                dummy, pattern = "stage I+|stage [:digit:]+|ESRD|CKD, unspecified"
        )
) %>% select(-"dummy") %>% mutate(
        dummy = str_extract(
                CKD, pattern = "I+"
        ) %>% as.roman() %>% as.numeric(),
        CKD = str_remove(
                CKD, pattern = "I+"
        )
) %>% unite(
        col = "CKD",
        na.rm = TRUE,
        CKD:dummy,sep = ""
)

# A tibble: 10 x 3
   Date       diabetes CKD    
   <chr>      <lgl>    <chr>  
 1 5/20/2019  TRUE     stage 3
 2 5/20/2019  TRUE     stage 3
 3 5/20/2019  TRUE     stage 3
 4 5/20/2019  FALSE    stage 3
 5 5/20/2019  TRUE     stage 2
 6 5/20/2019  TRUE     stage 3
 7 5/20/2019  TRUE     stage 3
 8 5/20/2019  FALSE    stage 2
 9 10/22/2018 TRUE     stage 4
10 5/20/2019  TRUE     ESRD

Update: Converted Romans into Numeric.

R, find character string from vector, return true/false column and a multiple choice column

2 Answers2