1

I have a question similar to this one. I want to convert various dummy/logical variables into a single categorical variable/factor based on their name in R. My question is different because there can be many groupings of variables that need to be encoded. For example age and chol_test in this example. This is just a subset of my data frame. There are additional variables such as diabetes_test, etc that would also need to be converted, so I can't just do starts_with("condition").

I want to encode the lows to be 1, mediums to be 2, and highs to be 3. If all the encoded variables are 0, leave as N/A.

list(low = 1, medium = 2, high = 3)

Basically the data looks like so:

Input

  race  gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
  <chr>  <int>       <int>          <int>        <int>             <int>              <int>
1 white      0           1              0            0                 0                  0
2 white      0           1              0            0                 0                  0
3 white      1           1              0            0                 0                  0
4 black      1           0              1            0                 0                  0
5 white      0           0              0            1                 0                  1
6 black      0           0              1            0                 1                  0

I want the output to look like so:

Expected Output:

  race  gender   age  chol_test
1 white      0     1        n/a  
2 white      0     1        n/a
3 white      1     1        n/a
4 black      1     2        n/a
5 white      0     3          3
6 black      0     2          1

How could I do this? I'm looking for a solution that is similar to the ones posted in the question I linked using dplyr if possible. Sorry for any redundancies.

Data

df <- structure(list(race = c("white", "white", "white", "black", "white", 
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L, 
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L, 
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 
0L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))
Eisen
  • 1,697
  • 9
  • 27

1 Answers1

0

This is how I would do it

df %>% 
  mutate(id = row_number()) %>%
  pivot_longer(cols = -c(race, gender, id)) %>%
  filter(value > 0) %>%
  separate(name, c("var", "range1"), sep = '\\.') %>%
  mutate(
    value = case_when(
      range1 == 'low_tm1' ~ 1, 
      range1 == 'medium_tm1' ~ 2, 
      range1 == 'high_tm1' ~ 3, 
    )
  ) %>%
  select(-range1) %>%
  pivot_wider(names_from = var, values_from = value) %>%
  select(-id)

  race  gender   age chol_test
  <chr>  <int> <dbl>     <dbl>
1 white      0     1        NA
2 white      0     1        NA
3 white      1     1        NA
4 black      1     2        NA
5 white      0     3         3
6 black      0     2         1
Quixotic22
  • 2,894
  • 1
  • 6
  • 14
  • This is great! I get a "Values are not uniquely identified; output will contain list-cols." warning however. Do you know why this would be the case? For chol_test I get values of c(1,3,NA) for every row with the full data set. – Eisen Jan 04 '22 at 14:52
  • Presumably in the `pivot_wider`. Does it run before that? If so are there any rows where say `chol_test.low_tm1` = 1 and `chol_test.medium_tm1` = 1? – Quixotic22 Jan 04 '22 at 14:59
  • No but there are cases where all of chol_test can be 0, so some ids would be removed by your filter(value >0) i think.. – Eisen Jan 04 '22 at 15:07
  • For example if there is a row with age and chol_test all 0, and you remvoe the filter condition, the warning will appear. – Eisen Jan 04 '22 at 15:20
  • The filter needs to be included, does it run with it in? – Quixotic22 Jan 04 '22 at 15:26