0

I have example data as follows:

df <- data.frame(Q1_A = c("This is a reason", NA, "This is a reason", NA),
                 Q1_B = c("This is another reason", "This is another reason", NA, NA))

Each answer had multiple answer possibilities. As a result, they had to be split out.NAs are therefore also not real NAs

I would like to run a regression in the form:

lm( y ~ Q1_A + Q1_B + ... + ) 

Which then shows as the output:

Coefficients:
(Intercept)         Q1_A         Q1_B
   34.66099     -0.02058     -1.58728  

I guess this means I need to turn all the NA values to base levels.

What is the best way to turn these variables into dummies?

Desired output:

df <- data.frame(Q1_A = c("This is a reason", "Baselevel", "This is a reason", "Baselevel"),
                 Q1_B = c("This is another reason", "This is another reason", "Baselevel", "Baselevel"))
Tom
  • 2,173
  • 1
  • 17
  • 44
  • what is your expected output? – Julian Mar 21 '23 at 10:05
  • It looks like you're mixing both long and wide formats. I would create an ID for each question and list all corresponding answers under the same ID, e.g., with a new row for each answer. – dufei Mar 21 '23 at 10:07
  • @Julian I have added some more information. – Tom Mar 21 '23 at 10:18
  • @dufei Could you elaborate a bit, or provide some link? What would be the benefit of your suggestion? – Tom Mar 21 '23 at 10:19
  • Try: `df[is.na(df)] <- "Baselevel"`. See also [How do I replace NA values with zeros in an R dataframe?](https://stackoverflow.com/q/8161836/10488504) or [Replace all NA with FALSE in selected columns in R](https://stackoverflow.com/q/7279089/10488504). – GKi Mar 21 '23 at 11:27

2 Answers2

1

Using tidyr::replace_na:

df |> mutate(across(starts_with("Q"), ~relevel(as.factor(tidyr::replace_na(., "Baselevel")),  ref = "Baselevel")))

For Q1_A you get

[1] This is a reason Baselevel        This is a reason Baselevel       
Levels: Baselevel This is a reason
Julian
  • 6,586
  • 2
  • 9
  • 33
1

When working with data like this we usually transform the reason columns into 0 and 1 dummies, while the column name indicates the reason. When the reasons are rather long we use a lookup data.frame to look the column names up when needed.

library(dplyr)
library(tidyr)

df %>% 
  mutate(across(c(Q1_A:Q1_B),
                   ~ ifelse(!is.na(.x), 1, 0))
            )

#>   Q1_A Q1_B
#> 1    1    1
#> 2    0    1
#> 3    1    0
#> 4    0    0

# create lookup df and use when necessary
lookup_df <- df %>%
  summarise(across(everything(), ~ na.omit(unique(.x)))) %>% 
  pivot_longer(everything())

lookup_df
#> # A tibble: 2 × 2
#>   name  value                 
#>   <chr> <chr>                 
#> 1 Q1_A  This is a reason      
#> 2 Q1_B  This is another reason

Data from OP

df <- data.frame(Q1_A = c("This is a reason", NA, "This is a reason", NA),
                 Q1_B = c("This is another reason", "This is another reason", NA, NA))

Created on 2023-03-21 with reprex v2.0.2

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39