2

I'm trying to create new variables based on a procedure code variable with 2500+ values in a medical dataset to pull out the antibiotics, their dose, and route. I've been able to do this with ifelse statements, but it was time consuming, and is difficult to find and correct mistakes. Is there a simplified way to do this? Unfortunately the codes aren't organized in any logical way.

vet <-mutate(vet, ab = ifelse(ProcedureCode=="6160"|ProcedureCode=="2028"|ProcedureCode=="6121"|ProcedureCode=="6130"|ProcedureCode=="6131"|ProcedureCode=="6132"|ProcedureCode=="6133" |ProcedureCode=="6134"|ProcedureCode=="6135"|ProcedureCode=="6136"|ProcedureCode=="6090" |ProcedureCode=="6137"|ProcedureCode=="6138"|ProcedureCode=="6139" |ProcedureCode=="6140" |ProcedureCode=="6510"|ProcedureCode=="680D" |ProcedureCode=="633E"|ProcedureCode=="661J"|ProcedureCode=="627I" |ProcedureCode=="6198"|ProcedureCode=="6199"|ProcedureCode=="6200" |ProcedureCode=="6201" |ProcedureCode=="6202"|ProcedureCode=="622G" |ProcedureCode=="697C" |ProcedureCode=="698C" |ProcedureCode=="6204"|ProcedureCode=="6775"| ProcedureCode=="6229" |ProcedureCode=="6207" |ProcedureCode=="6203" |ProcedureCode=="6205" |ProcedureCode=="6206" |ProcedureCode=="6212" |ProcedureCode=="6213" |ProcedureCode=="6214" |ProcedureCode=="6215" |ProcedureCode=="6216" |ProcedureCode=="6219" |ProcedureCode=="692C" |ProcedureCode=="643C" |ProcedureCode=="601E" |ProcedureCode=="629G" |ProcedureCode=="6234" |ProcedureCode=="6235" |ProcedureCode=="6236" |ProcedureCode=="6237" |ProcedureCode=="6238" |ProcedureCode=="615J" |ProcedureCode=="6242" |ProcedureCode=="6243" |ProcedureCode=="6244" |ProcedureCode=="6245" |ProcedureCode=="1193" |ProcedureCode=="652G" |ProcedureCode=="657G" |ProcedureCode=="697B"|ProcedureCode=="6336" |ProcedureCode=="6337" |ProcedureCode=="6338" |ProcedureCode=="6152" |ProcedureCode=="603C" |ProcedureCode=="655B" |ProcedureCode=="6357" |ProcedureCode=="6358" |ProcedureCode=="6399" |ProcedureCode=="666B" |ProcedureCode=="695D" |ProcedureCode=="699C" |ProcedureCode=="6365" |ProcedureCode=="6366" |ProcedureCode=="696F" |ProcedureCode=="6497" |ProcedureCode=="6613" |ProcedureCode=="6508" |ProcedureCode=="6509" |ProcedureCode=="617I" |ProcedureCode=="6506" |ProcedureCode=="2029" |ProcedureCode=="6538" |ProcedureCode=="671J" |ProcedureCode=="633H" |ProcedureCode=="621G" |ProcedureCode=="680J" |ProcedureCode=="672G" |ProcedureCode=="673G" |ProcedureCode=="6559" |ProcedureCode=="6652" |ProcedureCode=="6593" |ProcedureCode=="651C" |ProcedureCode=="633B" |ProcedureCode=="659E" |ProcedureCode=="676D" |ProcedureCode=="678D" |ProcedureCode=="620B" |ProcedureCode=="6562" |ProcedureCode=="6564" |ProcedureCode=="6585" |ProcedureCode=="6766" |ProcedureCode=="6595" |ProcedureCode=="6607" |ProcedureCode=="6608" |ProcedureCode=="627B" |ProcedureCode=="6653" |ProcedureCode=="6654" |ProcedureCode=="6655"|ProcedureCode=="6732" |ProcedureCode=="6733" |ProcedureCode=="6734"|ProcedureCode=="6735" |ProcedureCode=="6795"|ProcedureCode=="6745" |ProcedureCode=="6746" |ProcedureCode=="6748" |ProcedureCode=="6758" |ProcedureCode=="697E" |ProcedureCode=="6761" |ProcedureCode=="6032" |ProcedureCode=="6747" |ProcedureCode=="6749" |ProcedureCode=="668A" |ProcedureCode=="648A" |ProcedureCode=="649A" |ProcedureCode=="6765" |ProcedureCode=="6768" |ProcedureCode=="6771" |ProcedureCode=="637B"|ProcedureCode=="6894", 1,0))

The problem is also that I need to create multiple groups (ex: Antibiotic [yes/no], dose, route) and I feel like there's a better way I'm missing that doesn't involve cutting and pasting the variable and quotation marks each time. Is there potentially a way to make a data frame and use ifelse to assign any codes that are also in that dataframe as a 1 and others as a 0?

Sorry if this is duplicated, I'm relatively new to R and am having trouble finding the vocabulary to search for what I need. I have looked around (like Nested ifelse statement , but haven't found quite what I need.

Cœur
  • 37,241
  • 25
  • 195
  • 267
S. C.
  • 23
  • 4
  • 6
    You could use `ifelse(ProcedureCode %in% your_list_of_numbers, 1, 0)`. This is not a general solution for replacing an `ifelse`. You can also look at `dplyr::case_when` or base R `switch`. – steveb May 04 '18 at 16:23
  • 2
    Or, same logic but avoiding `ifelse` entirely: `mutate(vet, ab = as.integer(ProcedureCode %in% your_list_of_numbers))` – Gregor Thomas May 04 '18 at 16:37
  • 1
    @s.c. Has your question now been answered? If so please "accept" the solution, if not please clarify the question https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – rg255 May 05 '18 at 20:46

2 Answers2

1

Two alternative methods, both using merges/joins. One advantage of this approach is that it is much easier to maintain: you have well-structured and manageable tables of procedures instead of (potentially really-long) lines of code with your ifelse statement. The comments suggesting %in% also reduce this problem, though you'll deal with manageable vectors instead of mangeable frames.

Fake data:

library(dplyr)
library(tidyr)
vet <- data_frame(ProcedureCode = c('6160', '2028', '2029'))
  1. One frame per procedure type. This is manageable, but might be annoying if you have a lot of different types. Repeat the left_join for each type.

    abs <- data_frame(ab=TRUE, ProcedureCode = c('6160', '2028'))
    antis <- data_frame(antibiotic=TRUE, ProcedureCode = c('2029'))
    vet %>%
      left_join(abs, by = "ProcedureCode") %>%
      left_join(antis, by = "ProcedureCode") %>%
      mutate_at(vars(ab, antibiotic), funs(!is.na(.)))
    # # A tibble: 3 × 3
    #   ProcedureCode    ab antibiotic
    #           <chr> <lgl>      <lgl>
    # 1          6160  TRUE      FALSE
    # 2          2028  TRUE      FALSE
    # 3          2029 FALSE       TRUE
    

    The use of ab=TRUE (etc) is so that there is a column to merge. The rows that do not match will have an NA, which mandates the need for !is.na(.) to convert T,NA,T to T,F,T.

    You could even use vectors of procedure codes instead, something like:

    vet %>%
      left_join(data_frame(ab=TRUE, ProcedureCode=vector_of_abs), by = "ProcedureCode") %>%
      ...
    

    Though that really only helps if you already have the codes as vectors, otherwise it seems to be solely whichever is easier for you to maintain.

  2. One frame with all procedures, requiring only a single frame for types and a single left_join.

    procedures <- tibble::tribble(
      ~ProcedureCode, ~procedure,
      '6160'        , 'ab',
      '2028'        , 'ab',
      '2029'        , 'antibiotic'
    )
    left_join(vet, procedures, by = "ProcedureCode")
    # # A tibble: 3 × 2
    #   ProcedureCode  procedure
    #           <chr>      <chr>
    # 1          6160         ab
    # 2          2028         ab
    # 3          2029 antibiotic
    

    You can either keep it as-is (if it makes sense to store it that way) or spread it to be like the others:

    left_join(vet, procedures, by = "ProcedureCode") %>%
      mutate(ignore=TRUE) %>%
      spread(procedure, ignore) %>%
      mutate_at(vars(ab, antibiotic), funs(!is.na(.)))
    # # A tibble: 3 × 3
    #   ProcedureCode    ab antibiotic
    #           <chr> <lgl>      <lgl>
    # 1          2028  TRUE      FALSE
    # 2          2029 FALSE       TRUE
    # 3          6160  TRUE      FALSE
    

    (Order after the join/merge is different here, but the data remains the same.)

(I used logicals, it's easy enough to convert them to 1s and 0s, perhaps mutate(ab=1L*ab) or mutate(ab=as.integer(ab)).)

r2evans
  • 141,215
  • 6
  • 77
  • 149
0

A base R approach for a simple option:

# my dummy data
df1 <- data.frame("v1" = c(LETTERS[1:10]), "v2" = rep(NA, 10))

# step 1, fill the column with 0 (the else part of your code)
df1[,'v2'] <- 0

# step 2, create a vector containing ids you want to change
change_vec <- c("A", "C", "D", "F")

# step 3, use %in% to index and replace with 1
df1[,'v2'][df1[,'v1'] %in% change_vec] <- 1

In most cases this will be adequate, but be aware of the risks of using indexing vectors that contain numeric values.

https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f

rg255
  • 4,119
  • 3
  • 22
  • 40
  • 1
    This is often fine, though I warn against using `numeric` and `%in%`, related to the [R FAQ 7.31](https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f). I'd suggest either `integer` or (as the OP had it) `character` for your `$v1`. – r2evans May 04 '18 at 17:42
  • Thanks for linking to the FAQ, I had wondered why integers are better than numerics – rg255 May 04 '18 at 18:40