0

I have a large dataframe with multiple columns (about 150).
There is a range of columns (Dx1, Dx2..until Dx30) which are diagnosis codes (the codes are numbers, but they are categorical variables that correspond to a medical diagnosis using the ICD-9 coding system).

I have working code to search a single column, but need to search all 30 columns to see if any of the columns contain a code within the specified range (DXrange).

The core dataframe looks like:

Case  DX1   DX2   DX3  DX4...DX30
1     123   345   567  99    12
2     234   345   NA   NA    NA
3     456   567   789  345   34

Here is the working code:

## Defines a range of codes to search for    
DXrange <- factor(41000:41091, levels = levels(core$DX1)) 

## Search for the DXrange codes in column DX1.  

core$IndexEvent <- core$DX1 %in% DXrange & substr(core$DX1, 5, 5) != 2

## What is the frequency of the IndexEvent?
    cat("Frequency of IndexEvent : \n"); table(core$IndexEvent)

The working code is adapted from "Calculating Nationwide Readmissions Database (NRD) Variances, Report # 2017-01"

I could run this for each DX column and then sum them for a final IndexEvent total, but this is not very efficient.

RROBINSON
  • 191
  • 1
  • 2
  • 11
  • 1
    You said it is a working code, but it is not working unless you share the `core` dataset. – www Apr 30 '18 at 13:28
  • Please learn how to make a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – www Apr 30 '18 at 13:29
  • Thanks. Updated the question to include a subset of the core dataframe. – RROBINSON Apr 30 '18 at 13:37
  • you should add some R code that creates some sample data. – Jack Wasey Aug 03 '18 at 10:43
  • I'd recommend the CRAN package that me and some collaborators have created: icd. it can work with long or wide format data frames of medical codes, and, including ICD-9 and ICD-10 codes, and assign them to groups using standard peer-reviewed or custom group definitions. It is very fast, and is used and contributed to by researchers around the world over the last five years, and validated back to the original articles. Hope you find it useful. https://cran.r-project.org/package=icd – Jack Wasey Aug 03 '18 at 10:46

2 Answers2

2

I would first normalize my data, before searching in the codes, such as the following example:

set.seed(314)

df <- data.frame(id = 1:5,
                 DX1 = sample(1:10,5),
                 DX2 = sample(1:10,5),
                 DX3 = sample(1:10,5))

require(dplyr)
require(tidyr)

df %>% 
  gather(key,value,-id) %>%
  filter(value %in% 1:2)

or with just base R

df.long <- do.call(rbind,lapply(df[,2:4],function(x) data.frame(id = df$id, DX = x)))

df.long[df.long$DX %in% 1:2, ]
Wietze314
  • 5,942
  • 2
  • 21
  • 40
1

We could use filter_at with any_vars

df %>% 
  filter_at(vars(matches("DX\\d+")), any_vars(. %in% DXrange))

where

DXrange <- 41000:41091
akrun
  • 874,273
  • 37
  • 540
  • 662