0

I have a dataset that has over 10 million variables. It has 25 columns for diagnosis code. Each row represents one patient. Based on admission to the hospital Diagnosis code is generated and it can be placed anywhere between columns 1 to 25. I would like to create a new column for lets say disease A that has codes ranging from 480,481,482,483,484,V334,V555. If any of these codes are present in these 25 columns new column should return value of 1 else 0.

Original Table:

DX1 DX2 DX3 DX4. DX5. DX6. DX7. DX8. DX9. DX10. DX11.
481 482
V334
484
485.
V555 481
F666
G765 481

I want new column with 0 or 1 variable even if one of the row has values: 480,481,482,483,484,V334,V555. For e.g.

DX1 DX2 DX3 DX4. DX5. DX6. DX7. DX8. DX9. DX10. DX11. NewCol
481 482 1
V334 0
484 1
485. 1
V555 481 1
F666 0
G765 481 1

I have tried using:

%>%mutate(NewCol = recode(DX1,
                        "486" = "1",
                        .default = "0")) 

but this is only for column DX1 and variable 486. I want to apply for 25 consecutive columns ranging from DX1:DX25 and include all variables 480,481,482,483,484,V334,V555. Thank You!

Dev
  • 1
  • 1
  • Please provide a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – william3031 Dec 15 '20 at 23:51
  • Use `tidyr::pivot_longer()` to reshape your data set such that those 25 columns become one column, than use `dplyr::case_when()` to recode and group the conditions. – Phil Dec 16 '20 at 00:45
  • You should provide a small example of how your data looks along with expected output. See [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Dec 16 '20 at 01:39
  • Here is the example William3031, Ronak Shah. – Dev Dec 16 '20 at 01:52
  • Instead of posting your code as an answer, please edit your question and delete the answer. – AnilGoyal Dec 16 '20 at 07:46

1 Answers1

0

In the future, please provide a reproducible example. As to your question, I'd follow @Phil's recommendation and pivot to long-form data after adding an index for patient number. Then you just need to group by row and choose the maximum (it'll be either 1 or 0), and re-merge with the original data.

library(tidyverse)

#Read In Data
df <- data.frame(
  patient = c(1:7),
  dx1 = c('481',NA,NA,NA,'V555','F666',NA),
  dx2 = c(NA,NA,NA,NA,NA,NA,NA),
  dx3 = c(NA,NA,NA,NA,NA,NA,NA),
  dx4 = c(NA,NA,NA,NA,NA,NA,NA),
  dx5 = c(NA,NA,'484',NA,NA,NA,NA),
  dx6 = c('482',NA,NA,'485',NA,NA,NA),
  dx7 = c(NA,NA,NA,NA,NA,NA,NA),
  dx8 = c(NA,NA,NA,NA,NA,NA,NA),
  dx9 = c(NA,NA,NA,NA,NA,NA,NA),
  dx10 = c(NA,'V334',NA,NA,'481',NA,'481')
)

# Declare variables to count
valList = c('481','482','483','484','V334','V555')

#Count, Group, and Rejoin
df %>%
  pivot_longer(cols = dx1:dx10,
               names_to = 'columns',
               values_to = 'values') %>%
  mutate(in_valList = ifelse(values %in% valList,1,0)) %>%
  group_by(patient) %>% 
  summarise(NewCol = max(in_valList,na.rm = T)) %>%
  left_join(df,by='patient')

The output ends up looking like this, which you could reorder however you like if you absolutely must have NewCol at the end.

# A tibble: 7 x 12
  patient NewCol dx1   dx2   dx3   dx4   dx5   dx6   dx7   dx8   dx9   dx10 
    <int>  <dbl> <chr> <lgl> <lgl> <lgl> <chr> <chr> <lgl> <lgl> <lgl> <chr>
1       1      1 481   NA    NA    NA    NA    482   NA    NA    NA    NA   
2       2      1 NA    NA    NA    NA    NA    NA    NA    NA    NA    V334 
3       3      1 NA    NA    NA    NA    484   NA    NA    NA    NA    NA   
4       4      0 NA    NA    NA    NA    NA    485   NA    NA    NA    NA   
5       5      1 V555  NA    NA    NA    NA    NA    NA    NA    NA    481  
6       6      0 F666  NA    NA    NA    NA    NA    NA    NA    NA    NA   
7       7      1 NA    NA    NA    NA    NA    NA    NA    NA    NA    481  
Gru
  • 95
  • 1
  • 7