0

I am trying to write a default function which will work for any variable in the dataset and in turns create dummy variable for the same after removing the original variable.

dummy= function(x){
    xs = union(x,NULL)
    xm = matrix(0, length(x), length(xs))
    for(i in 1:length(x)){
        xm[i,which(xs==x[i])] = 1
    }
    return(xm[,1:length(xs)-1])
}

For example, from a dataset, I have a categorical variable called "Married". Now I want to create dummy variables like this Married Unmarried 1 0 1 0 0 1 0 1 0 1 Also it should remove the original variable "Married" and add dummy variable to the dataset.

niton
  • 8,771
  • 21
  • 32
  • 52
  • 1
    Note that R will create dummy variables on the fly when running statistical models, so it may not be necessary to construct them ahead of time. – lmo Apr 01 '17 at 12:47
  • @niton i understand what you mean..but I am working on a huge dataset where its a tedious task to create dummy variable everytime. so I want to make a default func so that I can just run the func and get the variables. – Vinita Rastogi Apr 01 '17 at 12:58

2 Answers2

0

Look into the documentation for the dummy.data.frame function from the dummies package. It allows for flexible use of the model.matrix function.

library(dummies)
set.seed(20170402)
n <- 5
df <- data.frame(x = rnorm(n), 
                 y = rnorm(n, 1), 
                 red_herring = as.logical(round(runif(n, 0, 1))))

# Character column
df$red_herring <- dplyr::if_else(df$red_herring == T, 'Yes', 'No', NA_character_)

# Factor column
df$married <- factor(df$red_herring, levels = c('No', 'Yes'))

Dummy variables are created for character and factor classes by default:

dummies::dummy.data.frame(df)
#             x         y red_herringNo red_herringYes marriedNo marriedYes
# 1 -2.49355296 1.6209886             0              1         0          1
# 2  0.06896791 2.6101371             1              0         1          0
# 3 -0.01188042 0.4857511             0              1         0          1
# 4  0.47565318 1.1194925             0              1         0          1
# 5  0.34437239 3.0801658             1              0         1          0

You can pass a vector of variables that you want transformed to the names argument:

dummies::dummy.data.frame(df, names = 'married')
#             x         y red_herring marriedNo marriedYes
# 1 -2.49355296 1.6209886         Yes         0          1
# 2  0.06896791 2.6101371          No         1          0
# 3 -0.01188042 0.4857511         Yes         0          1
# 4  0.47565318 1.1194925         Yes         0          1
# 5  0.34437239 3.0801658          No         1          0

Or you can specify what classes of variables you want transformed into dummy variables via dummy.classes:

dummies::dummy.data.frame(df, dummy.classes = 'factor')
#             x         y red_herring marriedNo marriedYes
# 1 -2.49355296 1.6209886         Yes         0          1
# 2  0.06896791 2.6101371          No         1          0
# 3 -0.01188042 0.4857511         Yes         0          1
# 4  0.47565318 1.1194925         Yes         0          1
# 5  0.34437239 3.0801658          No         1          0
Garrett Mooney
  • 170
  • 1
  • 4
0

I though I'd share my answer because I was looking for something similar for a long time and finally found a solution that worked well for me. I had a categorical column in a very large dataset that I needed to convert to dummy variables and it was not possible to do with matrix.model

Using indexed sparse matrix I was able to solve my problem. it's super fast and doesn't use up your memory if your data is large (mine had 5.8M rows and the categorical data had nearly 500 levels!)

Please refer to this post for more information: create a sparse matrix; given the indices of non-zero elements for creation of dummy variables of a categorical column of a large dataset

This is for converting one categorical column to dummy variables, however, you can easily scale it to more than one by tweaking around with your data and changing its format. For example, one way is to combine all categorical variables into ONE categorical column and expanding the level numbers:

Cat Var2: 100 levels
Cat Var3: 50 levels

you create a dummy categorical data by combining Var2 and Var3 into Var4:

Cat Var4: 150 levels (where the first 100 levels correspond to Var2 and the remaining 50 levels correspond to Var3)

Using indexed sparse matrix is super fast and memory efficient. And there is no need for ugly for-loops. Hope this helps.

Ankhnesmerira
  • 1,386
  • 15
  • 29