R: Search multiple columns for factors

Question

I have a large dataframe with multiple columns (about 150).
There is a range of columns (Dx1, Dx2..until Dx30) which are diagnosis codes (the codes are numbers, but they are categorical variables that correspond to a medical diagnosis using the ICD-9 coding system).

I have working code to search a single column, but need to search all 30 columns to see if any of the columns contain a code within the specified range (DXrange).

The core dataframe looks like:

Case  DX1   DX2   DX3  DX4...DX30
1     123   345   567  99    12
2     234   345   NA   NA    NA
3     456   567   789  345   34

Here is the working code:

## Defines a range of codes to search for    
DXrange <- factor(41000:41091, levels = levels(core$DX1)) 

## Search for the DXrange codes in column DX1.  

core$IndexEvent <- core$DX1 %in% DXrange & substr(core$DX1, 5, 5) != 2

## What is the frequency of the IndexEvent?
    cat("Frequency of IndexEvent : \n"); table(core$IndexEvent)

The working code is adapted from "Calculating Nationwide Readmissions Database (NRD) Variances, Report # 2017-01"

I could run this for each DX column and then sum them for a final IndexEvent total, but this is not very efficient.

You said it is a working code, but it is not working unless you share the `core` dataset. — www, Apr 30 '18 at 13:28
Please learn how to make a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — www, Apr 30 '18 at 13:29
Thanks. Updated the question to include a subset of the core dataframe. — RROBINSON, Apr 30 '18 at 13:37
I'd recommend the CRAN package that me and some collaborators have created: icd. it can work with long or wide format data frames of medical codes, and, including ICD-9 and ICD-10 codes, and assign them to groups using standard peer-reviewed or custom group definitions. It is very fast, and is used and contributed to by researchers around the world over the last five years, and validated back to the original articles. Hope you find it useful. https://cran.r-project.org/package=icd — Jack Wasey, Aug 03 '18 at 10:46

score 2 · Accepted Answer · answered Apr 30 '18 at 13:39

I would first normalize my data, before searching in the codes, such as the following example:

set.seed(314)

df <- data.frame(id = 1:5,
                 DX1 = sample(1:10,5),
                 DX2 = sample(1:10,5),
                 DX3 = sample(1:10,5))

require(dplyr)
require(tidyr)

df %>% 
  gather(key,value,-id) %>%
  filter(value %in% 1:2)

or with just base R

df.long <- do.call(rbind,lapply(df[,2:4],function(x) data.frame(id = df$id, DX = x)))

df.long[df.long$DX %in% 1:2, ]

Thank you for the suggestion. I am not sure what the purpose of sample(1:10,5) is — RROBINSON, Apr 30 '18 at 17:45

score 1 · Answer 2 · answered Apr 30 '18 at 13:44

1

We could use filter_at with any_vars

df %>% 
  filter_at(vars(matches("DX\\d+")), any_vars(. %in% DXrange))

where

DXrange <- 41000:41091

answered Apr 30 '18 at 13:44

akrun

874,273
37
540
662

R: Search multiple columns for factors

2 Answers2