0

I am looking to create a function that will convert any factor variable with more than 4 levels into a dummy variable. The dataset has ~2311 columns, so I would really need to create a function. Your help would be immensely appreciated.

I have compiled the code below and was hoping to get it to work.

library(dummies)

# example function

for(i in names(Final_Dataset)){
    if(count (Final_Dataset[i])>4){
        y <- Final_Dataset[i]
        Final_Dataset <- cbind(Final_Dataset, dummy(y, sep = "_"))    
    }
}

I was also considering an alternative approach where I would get all the number of columns that need to be dummied and then loop through all the columns and if the column number is in that array then create dummy variables out of the variable.

lmo
  • 37,904
  • 9
  • 56
  • 69
Lowpar
  • 897
  • 10
  • 31

2 Answers2

2

Example data

fct = data.frame(a = as.factor(letters[1:10]), b = 1:10, c = as.factor(sample(letters[1:4], 10, replace = T)), d = as.factor(letters[10:19]))

str(fct)

'data.frame':   10 obs. of  4 variables:
 $ a: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
 $ b: int  1 2 3 4 5 6 7 8 9 10
 $ c: Factor w/ 4 levels "a","b","c","d": 2 4 1 3 1 1 2 3 1 2
 $ d: Factor w/ 10 levels "j","k","l","m",..: 1 2 3 4 5 6 7 8 9 10

# keep columns with more than 4 factors
fact_cols = sapply(fct, function(x) is.factor(x) && length(levels(x)) > 4)

# create dummy variables for subset (omit intercept)
dummy_cols = model.matrix(~. -1, fct[, fact_cols])

# cbind new data
out_df = cbind(fct[, !fact_cols], dummy_cols)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
lampros
  • 581
  • 5
  • 12
  • Your `fact_cols` definition can be simplified a bit: `fact_cols = sapply(fct, function(x) is.factor(x) && length(levels(x)) > 4)`. This returns a boolean vector so you'd also need to change the `-fact_cols` to `!fact_cols` in the final line. – Gregor Thomas Jul 19 '17 at 19:08
  • Hi guys, super answer, I tried it and see the output - – Lowpar Jul 19 '17 at 19:16
  • fact_cols = unlist(lapply(1:ncol(Final_Dataset), function(x) is.factor(Final_Dataset[, x]) && length(levels(Final_Dataset[, x])) > 4)) > > # create dummy variables for subset > dummy_cols = model.matrix(~. -1, Final_Dataset[, fact_cols]) There were 12 warnings (use warnings() to see them) > > # cbind new data – Lowpar Jul 19 '17 at 19:16
  • > out_df = cbind(Final_Dataset[, -fact_cols], dummy_cols) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 15257, 0 – Lowpar Jul 19 '17 at 19:16
  • @Lowpar It works fine on lampros's nicely shared data. If it doesn't work on your data you should share a illustrative subset so that we can try to debug. Please edit it into the question. Use `dput` so it is copy/pasteable. See tips [here](https://stackoverflow.com/q/5963269/903061) – Gregor Thomas Jul 19 '17 at 19:25
  • @Gregor it definitely works, just not on my data. Thanks for the help. – Lowpar Jul 19 '17 at 19:37
  • Right - *which is why you should share your data*. Then we can help you get it to work on your data. If you don't share your data, then we can't help any more. Share your data. – Gregor Thomas Jul 19 '17 at 20:06
0

You could get all the columns with more than a given number of levels (n = 4) with something like

which(sapply(Final_Dataset, function (c) length(levels(c)) > n))
merv
  • 67,214
  • 13
  • 180
  • 245