1

I have a dataset with around 2 million samples. There are 30 columns for diagnoses which are coded, each labeled I10_DX1 to I10_DX30. I need to create a new variable from these existing columns.

For example, data= CHD_2016, new variable I would like to generate is =tetralogy. If any of the codes in columns I10_DX1 to I10_DX30 have Q0[01234567], it will be coded 1 otherwise 0. I am able to make it work using the code below. But, I think there has to be a shorter and better way to do it. In stata, I am able to do this in few lines using loop. Can someone suggest a better way please!

CHD_2016$tetralogy<- ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX1), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX2), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX3), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX4), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX5), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX6), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX7), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX8), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX9), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX10), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX11), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX12), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX13), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX14), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX15), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX16), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX17), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX18), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX19), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX20), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX21), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX22), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX23), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX24), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX25), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX26), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX27), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX28), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX29), 1, 0)|
          ifelse(grepl("Q0[01234567]", CHD_2016$I10_DX30), 1, 0) ```
Laxmi
  • 21
  • 4
  • 1
    Can you provide some example data using `dput(head(df))`?. – rpolicastro Aug 01 '20 at 11:17
  • Hi- this is the data like: I10_DX1 = structure(c(4970L, 8006L, 16246L, 16247L, 16246L, 7696L), .Label = c("", "A000", "A0100", "A0109", "A011", "A020", "A021", "A0220", "A0221", "A0223", "A0224", "A0225", "A0229", "A028", "A029", "A030", "A031", "A033", "A038", "A039", "A040", "A041", "A042", "A043", "A044", "A045", "A046", "A047", "A048", "A049", "A050", "A051", "A058", "A059", "A060", – Laxmi Aug 01 '20 at 11:22
  • 1
    The parenthesis doesn't close. Can you post the data example in the question instead? – Rui Barradas Aug 01 '20 at 11:29

2 Answers2

4

You can try using lapply with Reduce for selected columns

cols <- grep('I10_DX', names(CHD_2016))
CHD_2016$tetralogy <- as.integer(Reduce(`|`, lapply(CHD_2016[cols], 
                                 grepl, pattern = "Q0[01234567]")))

Also with sapply and rowSums :

CHD_2016$tetralogy <- as.integer(rowSums(sapply(CHD_2016[cols], 
                                 grepl, pattern = "Q0[01234567]")) > 0)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Assuming your data variable codes are stored as strings, you can use the %in% operator with a for loop. Over several million data points though I think solutions involving lapply rather than a for loop will be much faster.

# Example data
ID10_DX1 <- sample(LETTERS)[1:20]
ID10_DX2 <- sample(LETTERS)[1:20]
ID10_DX3 <- sample(LETTERS)[1:20]
ID10_DX4 <- sample(LETTERS)[1:20]
ID10_DX5 <- sample(LETTERS)[1:20]
ID10_DX6 <- sample(LETTERS)[1:20]
ID10_DX7 <- sample(LETTERS)[1:20]
ID10_DX8 <- sample(LETTERS)[1:20]
ID10_DX9 <- sample(LETTERS)[1:20]
ID10_DX10 <- sample(LETTERS)[1:20]
CHD_2016 <- data.frame(ID10_DX1,ID10_DX2,ID10_DX3,ID10_DX4,ID10_DX5,ID10_DX6,ID10_DX7,ID10_DX8,ID10_DX9,ID10_DX10)

# Makes character variables for example data. 
# Not part of the solution and don't use on your own data.
CHD_2016[] <- lapply(CHD_2016, function(x) as.character(x)) 

# This following solution should work to find string "X" in any column
# and report in tetralogy column
string <- "X"
CHD_2016$tetralogy <- c()
for (i in 1:nrow(CHD_2016)) { 
  CHD_2016$tetralogy[i] <- string %in% CHD_2016[i,]
}
Roasty247
  • 679
  • 5
  • 20