1

After creating some dummy variables, R creates some unhelpful colnames: they start with ".data_"

a <- as.factor(c("green", "yellow", "blue"))
b <- as.factor(c("blue", "yellow", "green"))

df <- data.frame(a, b)

library(fastDummies)
dummy1 <- dummy_cols(df$a, remove_selected_columns = TRUE)
dummy2 <- dummy_cols(df$b, remove_selected_columns = TRUE)

I need to put the dummys back together in a dataframe, so how do I replace the ".data_" part in each column with the name of the variable it belongs to (e.g. a_blue, a_green, a_yellow for dummy1 and b_blue, b_green, b_yellow for dummy 2)?

I found rename() but I would have to use it for every variable single handedly. Is there a more automated way?

EDIT: After using dummy_cols(), the output is a data frame with as many new variables as you have had categories for that variable before. So a with 3 categories yellow, blue and green becomes a dataframe with 3 columns called .data_blue, .data_green, .data_yellow. Those new variables are binary. Maybe this helps to illustrate what I mean.

aynber
  • 22,380
  • 8
  • 50
  • 63
Elena
  • 45
  • 5
  • what does `dummy_cols` do? – latlio Jan 12 '21 at 12:56
  • it turns the categorical variable into a dummy. So a becomes a dataframe with 3 binary variables – Elena Jan 12 '21 at 13:53
  • @Elena do you have many different dataframes `dummy1, ..., dummy2` or a single, large dataframe with all the variables and all their categories? – Ric S Jan 12 '21 at 14:20
  • many different ones. One per variable (I have 21 variables) that need to be combined into one. I tried using dummy_cols on the whole dataframe but the results were... weird. If you know a different way, I'd love to hear it. – Elena Jan 12 '21 at 14:28

1 Answers1

0

The function wants the whole cake at once.

cols <- c("a", "b")
dummy_cols(df[cols], remove_selected_columns=TRUE)
#   a_blue a_green a_yellow b_blue b_green b_yellow
# 1      0       1        0      1       0        0
# 2      0       0        1      0       0        1
# 3      1       0        0      0       1        0
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Using the whole df, I get an output with the colnames `.data_blue, .data_green, .data_yellow, .data_blue, .data_green, .data_yellow` instead of `a_blue, a_green, etc.` like I wish. Because my categories for the variables are the same, this is quite confusing. – Elena Jan 12 '21 at 16:02
  • also, I just checked and using the whole df I get an error: `No character or factor columns found. Please use select_columns to choose columns.` – Elena Jan 12 '21 at 16:06
  • You should make that reproducible for others then, read: https://stackoverflow.com/a/5963610/6574038 – jay.sf Jan 12 '21 at 16:24
  • Oh wow, you are right. My example works perfectly as well. This is not frustrating at all. Thank you, though. – Elena Jan 12 '21 at 16:59