Recoding factor variables with a lot of levels into dummies?

Question

I am working on dataset with more than 230 variables among which I have about 60 categorical var with more than 6 six levels (no way to make preference ordering, example: Color)

My question is about any function that can help me to recode these variables without doing it by hand which requires a lot of work and time with a risk to make many mistakes!

I can work with R and python, so feel free to suggest the most efficient function that can do the job.

let's say, I have the dataset called df and the set of factorial columns is

clm=(clm1, clm2,clm3,....,clm60)

all of them are factors with a lot of levels:

(min=2, max=not important [may be 10, 30 or 100...etc])

Your help is much appreciated!

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html ? — Dan, Mar 19 '18 at 10:27
Possible duplicate of [How do I make a dummy variable in R?](https://stackoverflow.com/questions/12843456/how-do-i-make-a-dummy-variable-in-r) — Benjamin, Mar 19 '18 at 10:32

score 3 · Answer 1 · answered Mar 19 '18 at 10:32

Here is a short example using model.matrix that should get you started:

df <- data.frame(
    clm1 = gl(2, 6, 12, c("clm1.levelA", "clm1.levelB")),
    clm2 = gl(3, 4, 12, c("clm2.levelA", "clm2.levelB", "clm2.levelC")));
#          clm1        clm2
#1  clm1.levelA clm2.levelA
#2  clm1.levelA clm2.levelA
#3  clm1.levelA clm2.levelA
#4  clm1.levelA clm2.levelA
#5  clm1.levelA clm2.levelB
#6  clm1.levelA clm2.levelB
#7  clm1.levelB clm2.levelB
#8  clm1.levelB clm2.levelB
#9  clm1.levelB clm2.levelC
#10 clm1.levelB clm2.levelC
#11 clm1.levelB clm2.levelC
#12 clm1.levelB clm2.levelC



as.data.frame.matrix(model.matrix(rep(0, nrow(df)) ~ 0 + clm1 + clm2, df));
#   clm1clm1.levelA clm1clm1.levelB clm2clm2.levelB clm2clm2.levelC
#1                1               0               0               0
#2                1               0               0               0
#3                1               0               0               0
#4                1               0               0               0
#5                1               0               1               0
#6                1               0               1               0
#7                0               1               1               0
#8                0               1               1               0
#9                0               1               0               1
#10               0               1               0               1
#11               0               1               0               1
#12               0               1               0               1

score 0 · Answer 2 · answered Mar 19 '18 at 11:03

0

With pandas in python3, you can do something like:

import pandas as pd
df = pd.DataFrame({'clm1': ['clm1a', 'clm1b', 'clm1c'], 'clm2': ['clm2a', 'clm2b', 'clm2c']})
pd.get_dummies(df)

See the documentation for more examples.

answered Mar 19 '18 at 11:03

hpesoj626

3,529
1
17
25

Gilles San Martin · Answer 3 · 2018-03-19T14:03:57.613

In R, the problem with the model.matrix approach proposed by @Maurits Evers is that excepted for the first factor, the function drops the first level of each factor. Sometimes this is what you want but sometimes it is not (depending on the problem as underlined by @Maurits Evers).

There are several functions scattered in different packages to do that (eg package caret see here for several examples).

I use the following function inspired by this Stack Overflow answer by @Jaap

#' 
#' Transform factors from a data.frame into dummy variables (one hot encoding)
#' 
#' This function will transform all factors into dummy variables with one column
#' for each level of the factor (unlike the contrasts matrices that will drop the first
#' level). The factors with only two levels will have only one column (0/1 on the second 
#' level). The ordered factors and logicals are transformed into numeric.
#' The numeric and text vectors will remain untouched.
#'

make_dummies <- function(df){

    # function to create dummy variables for one factor only
    dummy <- function(fac, name = "") {

        if(is.factor(fac) & !is.ordered(fac)) {
            l <- levels(fac)
            res <- outer(fac, l, function(fac, l) 1L * (fac == l))
            colnames(res) <- paste0(name, l)
            if(length(l) == 2) {res <- res[,-1, drop = F]}
            if(length(l) == 1) {res <- res}
        } else if(is.ordered(fac) | is.logical(fac)) {
            res <- as.numeric(fac)
        } else {
            res <- fac
        }
        return(res)
    }

    # Apply this function to all columns
    res <- (lapply(df, dummy))
    # change the names of the cases with only one column
    for(i in seq_along(res)){
        if(any(is.matrix(res[[i]]) & ncol(res[[i]]) == 1)){
            colnames(res[[i]]) <- paste0(names(res)[i], ".", colnames(res[[i]]))
        }
    }
    res <- as.data.frame(res)
    return(res)
}

Example :

df <- data.frame(num = round(rnorm(12),1),
                 sex = factor(c("Male", "Female")),
                 color = factor(c("black", "red", "yellow")),
                 fac2 = factor(1:4),
                 fac3 = factor("A"),
                 size =  factor(c("small", "middle", "big"),
                                levels = c("small", "middle", "big"), ordered = TRUE),
                 logi = c(TRUE, FALSE))
print(df)
#>     num    sex  color fac2 fac3   size  logi
#> 1   0.0   Male  black    1    A  small  TRUE
#> 2  -1.0 Female    red    2    A middle FALSE
#> 3   1.3   Male yellow    3    A    big  TRUE
#> 4   1.4 Female  black    4    A  small FALSE
#> 5  -0.9   Male    red    1    A middle  TRUE
#> 6   0.1 Female yellow    2    A    big FALSE
#> 7   1.4   Male  black    3    A  small  TRUE
#> 8   0.1 Female    red    4    A middle FALSE
#> 9   1.6   Male yellow    1    A    big  TRUE
#> 10  1.1 Female  black    2    A  small FALSE
#> 11  0.2   Male    red    3    A middle  TRUE
#> 12  0.3 Female yellow    4    A    big FALSE
make_dummies(df)
#>     num sex.Male color.black color.red color.yellow fac2.1 fac2.2 fac2.3
#> 1   0.0        1           1         0            0      1      0      0
#> 2  -1.0        0           0         1            0      0      1      0
#> 3   1.3        1           0         0            1      0      0      1
#> 4   1.4        0           1         0            0      0      0      0
#> 5  -0.9        1           0         1            0      1      0      0
#> 6   0.1        0           0         0            1      0      1      0
#> 7   1.4        1           1         0            0      0      0      1
#> 8   0.1        0           0         1            0      0      0      0
#> 9   1.6        1           0         0            1      1      0      0
#> 10  1.1        0           1         0            0      0      1      0
#> 11  0.2        1           0         1            0      0      0      1
#> 12  0.3        0           0         0            1      0      0      0
#>    fac2.4 fac3.A size logi
#> 1       0      1    1    1
#> 2       0      1    2    0
#> 3       0      1    3    1
#> 4       1      1    1    0
#> 5       0      1    2    1
#> 6       0      1    3    0
#> 7       0      1    1    1
#> 8       1      1    2    0
#> 9       0      1    3    1
#> 10      0      1    1    0
#> 11      0      1    2    1
#> 12      1      1    3    0

Created on 2018-03-19 by the reprex package (v0.2.0).

*"Sometimes this is what you want but often it is not."* That really depends on the problem, doesn't it. I would argue that most often you'd want exactly that; hence the reason for `model.matrix`. For example, dummy encoding in most statistical models in R (e.g.`lm`, `glm`, etc.) is making implicit use of `model.matrix`. — Maurits Evers, Mar 19 '18 at 13:18
Yes I agree that it depends on the problem ! But when you use modeling, generally the construction of the model matrix is embedded in the modeling tool and you don't have to do it by hand... In this situation, removing the intercept provide also a rather strange matrix where the encoding is different for the first factor than for the subsequent ones. The encoding presented here is for example useful when you want to display a heatmap with hierarchical clustering (eg on Gower distance) or to compute Euclidean distances on the dummy matrix (although some will argue against this approach...) — Gilles San Martin, Mar 19 '18 at 13:59
Removing the intercept in a statistical model is *not* strange at all; best example is when you work with standardised data (see [here](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model) or [here](https://stats.stackexchange.com/questions/21738/why-is-a-zero-intercept-linear-regression-model-predicts-better-than-a-model-wit) for an extended discussion). [...] — Maurits Evers, Mar 19 '18 at 23:51
[...] The construction of the `model.matrix` being embedded in various modelling functions (which is exactly what I said in my first comment) shouldn't stop you from using `model.matrix`. As a matter of fact, it is useful for cases exactly like the one we're discussing here. — Maurits Evers, Mar 19 '18 at 23:52

Recoding factor variables with a lot of levels into dummies?

3 Answers3