What is the best way to turn variables with either an answer or NA into dummy variables?

Question

I have example data as follows:

df <- data.frame(Q1_A = c("This is a reason", NA, "This is a reason", NA),
                 Q1_B = c("This is another reason", "This is another reason", NA, NA))

Each answer had multiple answer possibilities. As a result, they had to be split out.NAs are therefore also not real NAs

I would like to run a regression in the form:

lm( y ~ Q1_A + Q1_B + ... + )

Which then shows as the output:

Coefficients:
(Intercept)         Q1_A         Q1_B
   34.66099     -0.02058     -1.58728

I guess this means I need to turn all the NA values to base levels.

What is the best way to turn these variables into dummies?

Desired output:

df <- data.frame(Q1_A = c("This is a reason", "Baselevel", "This is a reason", "Baselevel"),
                 Q1_B = c("This is another reason", "This is another reason", "Baselevel", "Baselevel"))

It looks like you're mixing both long and wide formats. I would create an ID for each question and list all corresponding answers under the same ID, e.g., with a new row for each answer. — dufei, Mar 21 '23 at 10:07
@dufei Could you elaborate a bit, or provide some link? What would be the benefit of your suggestion? — Tom, Mar 21 '23 at 10:19
Try: `df[is.na(df)] <- "Baselevel"`. See also [How do I replace NA values with zeros in an R dataframe?](https://stackoverflow.com/q/8161836/10488504) or [Replace all NA with FALSE in selected columns in R](https://stackoverflow.com/q/7279089/10488504). — GKi, Mar 21 '23 at 11:27

Julian · Answer 1 · 2023-03-21T11:15:31.657

1

Using tidyr::replace_na:

df |> mutate(across(starts_with("Q"), ~relevel(as.factor(tidyr::replace_na(., "Baselevel")),  ref = "Baselevel")))

For Q1_A you get

[1] This is a reason Baselevel        This is a reason Baselevel       
Levels: Baselevel This is a reason

edited Mar 21 '23 at 11:15

answered Mar 21 '23 at 10:22

Julian

6,586
2
9
33

Thank you for your answer. Is there a way to additionally set the "Baselevel" as the baselevel for the factors? – Tom Mar 21 '23 at 10:51
I made an edit. – Julian Mar 21 '23 at 11:15

TimTeaFan · Answer 2 · 2023-03-21T10:28:12.963

When working with data like this we usually transform the reason columns into 0 and 1 dummies, while the column name indicates the reason. When the reasons are rather long we use a lookup data.frame to look the column names up when needed.

library(dplyr)
library(tidyr)

df %>% 
  mutate(across(c(Q1_A:Q1_B),
                   ~ ifelse(!is.na(.x), 1, 0))
            )

#>   Q1_A Q1_B
#> 1    1    1
#> 2    0    1
#> 3    1    0
#> 4    0    0

# create lookup df and use when necessary
lookup_df <- df %>%
  summarise(across(everything(), ~ na.omit(unique(.x)))) %>% 
  pivot_longer(everything())

lookup_df
#> # A tibble: 2 × 2
#>   name  value                 
#>   <chr> <chr>                 
#> 1 Q1_A  This is a reason      
#> 2 Q1_B  This is another reason

Data from OP

df <- data.frame(Q1_A = c("This is a reason", NA, "This is a reason", NA),
                 Q1_B = c("This is another reason", "This is another reason", NA, NA))

^{Created on 2023-03-21 with reprex v2.0.2}

What is the best way to turn variables with either an answer or NA into dummy variables?

2 Answers2