1

I have a dataset like this:

ICD_10 diagnosis
A00 Cholera
A01-A03 Other Intestinal infectious diseases
A15 Respiratory tuberculosis
A17-A19 Other tuberculosis

...

On row 2 and 4, there are multiple ICD-10 codes, and I want to expand them into multiple rows, like below:

ICD_10 diagnosis
A00 Cholera
A01 Other Intestinal infectious diseases
A02 Other Intestinal infectious diseases
A03 Other Intestinal infectious diseases
A15 Respiratory tuberculosis
A17 Other tuberculosis
A18 Other tuberculosis
A19 Other tuberculosis

How can I accomplish this in R using tidyverse?

Thanks for your help!

zx8754
  • 52,746
  • 12
  • 114
  • 209
Bobby
  • 27
  • 3
  • Is it always the case that the letter in a range is the same letter? For instance, one would never see `A01-B05`? – r2evans Oct 31 '22 at 21:21
  • 2
    @r2evans - if it's ICD-10 coding, I don't think you should see that sort of thing. You do however see instances like H10-H13 only having H10, H11 and H13, no H12. – thelatemail Oct 31 '22 at 21:29
  • 1
    @thelatemail, it sounds like you and the OP know what `ICD_10` means, it is not obvious what constraints or standards for it may exist. – r2evans Oct 31 '22 at 21:31
  • 1
    @thelatemail looks like there is a dedicated packages for this job. – zx8754 Oct 31 '22 at 22:23

3 Answers3

2

Using dedicated icd package:

#data
d <- structure(list(ICD_10 = c("A00", "A01-A03", "A15", "A17-A19"), diagnosis = c("Cholera", "Other Intestinal infectious diseases", "Respiratory tuberculosis", "Other tuberculosis")), class = "data.frame", row.names = c(NA, -4L))

#remotes::install_github("jackwasey/icd")
library(icd)

To avoid creating non-existent or missing out existing codes between the ranges we use expand_ranges. For example, below returns 33 codes, instead of 3 if we filled in sequentially A01, A02, A03, which is wrong.

expand_range("A01", "A03")
#  [1] "A01"   "A010"  "A0100" "A0101" "A0102" "A0103" "A0104" "A0105"
#  [9] "A0109" "A011"  "A012"  "A013"  "A014"  "A02"   "A020"  "A021" 
# [17] "A022"  "A0220" "A0221" "A0222" "A0223" "A0224" "A0225" "A0229"
# [25] "A028"  "A029"  "A03"   "A030"  "A031"  "A032"  "A033"  "A038" 
# [33] "A039"

We also use explain_code, to give description for newly created codes, example usage:

explain_code("A01")
# [1] "Typhoid and paratyphoid fevers"

Now, wrap two functions into one, to get a pretty output

# custom function using expand_range
f <- function(icd10, diagnosis){
  x <- unlist(strsplit(icd10, "-"))
  
  if(length(x) == 1){ ICD10 = x 
  } else {ICD10 = expand_range(x[1], x[2])}
  
  data.frame(
    icd10 = icd10,
    diagnosis = diagnosis, 
    icd10range = ICD10,
    desc = explain_code(ICD10))
}

And loop through the codes to expand, then rowbind:

# loop through rows, and rowbind
res <- do.call(rbind, 
               mapply(f, d$ICD_10, d$diagnosis,
                      SIMPLIFY = FALSE, USE.NAMES = FALSE))

head(res)
#     icd10                            diagnosis icd10range                                 desc
# 1     A00                              Cholera        A00                              Cholera
# 2 A01-A03 Other Intestinal infectious diseases        A01       Typhoid and paratyphoid fevers
# 3 A01-A03 Other Intestinal infectious diseases       A010                        Typhoid fever
# 4 A01-A03 Other Intestinal infectious diseases      A0100           Typhoid fever, unspecified
# 5 A01-A03 Other Intestinal infectious diseases      A0101                   Typhoid meningitis
# 6 A01-A03 Other Intestinal infectious diseases      A0102 Typhoid fever with heart involvement

As expected A01-A03 now expanded into 33 rows:

table(res$icd10)
# A00 A01-A03     A15 A17-A19 
#   1      33       1      53 
zx8754
  • 52,746
  • 12
  • 114
  • 209
1
fun <- function(vec) {
  ltr <- substring(vec, 1, 1)
  L <- lapply(strsplit(gsub("[^-0-9]", "", vec), "-"), as.integer)
  mapply(function(ltr, z) sprintf("%s%02i", ltr, if (length(z) > 1) seq(z[1], z[2]) else z),
         ltr, L)
}
quux %>%
  mutate(ICD_10 = fun(ICD_10)) %>%
  tidyr::unnest(ICD_10)
# # A tibble: 8 x 2
#   ICD_10 diagnosis                           
#   <chr>  <chr>                               
# 1 A00    Cholera                             
# 2 A01    Other Intestinal infectious diseases
# 3 A02    Other Intestinal infectious diseases
# 4 A03    Other Intestinal infectious diseases
# 5 A15    Respiratory tuberculosis            
# 6 A17    Other tuberculosis                  
# 7 A18    Other tuberculosis                  
# 8 A19    Other tuberculosis                  

Data

quux <- structure(list(ICD_10 = c("A00", "A01-A03", "A15", "A17-A19"), diagnosis = c("Cholera", "Other Intestinal infectious diseases", "Respiratory tuberculosis", "Other tuberculosis")), class = "data.frame", row.names = c(NA, -4L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

One option:

tibble::tribble(
  ~ICD_10, ~diagnosis,
  "A00", "Cholera",
  "A01-A03", "Other Intestinal infectious diseases",
  "A15", "Respiratory tuberculosis",
  "A17-A19", "Other tuberculosis"
) |> 
  tidyr::separate_rows(ICD_10, sep = "-") |> 
  mutate(id = parse_number(ICD_10)) |> 
  group_by(diagnosis) |> 
  complete(id = min(id):max(id)) |> 
  mutate(ICD_10 = paste0("A", id))
Philippe Massicotte
  • 1,321
  • 12
  • 24
  • 1
    To make it use the letter from the code (when not necessarily A), you could replace last line with: `tidyr::fill(ICD_10) %>% mutate(ICD_10 = paste0(substr(ICD_10, 1, 1), id))` – Jon Spring Oct 31 '22 at 21:38