76

I have a data.frame consisting of numeric and factor variables as seen below.

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.

Is there a way to have model.matrix create the dummy for every level of the factor?

sjakobi
  • 3,546
  • 1
  • 25
  • 43
Jared
  • 3,510
  • 3
  • 25
  • 28

11 Answers11

72

(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

Which slots nicely into @fabians answer:

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 26
    +1. nice. you can automate it even more by replacing 4:5 with sapply(testFrame, is.factor) – fabians Dec 31 '10 at 18:05
  • Great solution for automation. Between the two of you my question has been answered perfectly, so I'm not sure whose answer should get the mark as the "Accepted Answer." I want you both to get credit. – Jared Jan 02 '11 at 02:48
  • 8
    @Jared: @fabians was the answer you were looking for, so he should get the credit - my contribution was just a little bit of sugar on top. – Gavin Simpson Jan 02 '11 at 10:27
55

You need to reset the contrasts for the factor variables:

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

or, with a little less typing and without the proper names:

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))
fabians
  • 3,383
  • 23
  • 23
  • 14
    That completely worked and I'll take that answer, but if I'm entering in 20 factors is there a way to universally do that for all variables in a frame or am I destined to typing way too much? – Jared Dec 31 '10 at 00:16
19

caret implemented a nice function dummyVars to achieve this with 2 lines:

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

Checking the final columns:

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"   

The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.

More info: http://amunategui.github.io/dummyVar-Walkthrough/

Pablo Casas
  • 868
  • 13
  • 15
12

dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html

Sagar Jauhari
  • 587
  • 7
  • 13
  • Seems nice, but doesn't include an intercept and I can't seem to force it to. – Jared Mar 14 '13 at 17:06
  • 2
    @jared: It works for me. Example: `require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df)` – Andrew Dec 30 '15 at 23:00
  • 1
    @Jared no need for intercept when you have a dummy variable for every level of the factor. – Will Townes Mar 30 '16 at 00:50
  • 1
    @Jared: This add intercept column: `require(caret); (df <- data.frame(x1=c('a','b'), x2=1:2)); dummies <- dummyVars(x2~ ., data = df); predict(dummies, newdata = df); cbind(1, predict(dummies, newdata = df))` – MYaseen208 Nov 10 '17 at 07:58
3

Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)

Then say you get something like this:

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

We want to get rid of the **'d reference levels of each factor

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))
user36302
  • 380
  • 3
  • 12
3

A tidyverse answer:

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

yields the desired result (same as @Gavin Simpson's answer):

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0
shosaco
  • 5,915
  • 1
  • 30
  • 48
2

Using the R package 'CatEncoders'

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output
asdf123
  • 21
  • 2
2

I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).

Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package library('CatEncoders') as well.

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

Source : R for Everyone: Advanced Analytics and Graphics (page273)

Mankind_008
  • 2,158
  • 2
  • 9
  • 15
  • Thanks for the answer. The funny thing is, the `build.x` function was written by me and made possible by the answers from @fabiens and @gavin! And that's my book! So cool this came full circle. Thanks for reading! – Jared Feb 17 '19 at 06:32
2

I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0
Ben2018
  • 535
  • 3
  • 11
1
model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

or

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

should be the most straightforward

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • This will work well if there is only one factor, but if there are multiple factors there will still be reference levels omitted. – Gregor Thomas Jul 11 '19 at 20:22
1

You can use tidyverse to achieve this without specifying each column manually.

The trick is to make a "long" dataframe.

Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.

Code:

library(tidyverse)

## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)

testFrame %>%
    ## pivot to "long" format
    gather(feature, value, -id) %>%
    ## add indicator value
    mutate(indicator=1) %>%
    ## create feature name that unites a feature and its value
    unite(feature, value, col="feature_value", sep="_") %>%
    ## convert to wide format, filling missing values with zero
    spread(feature_value, indicator, fill=0)

The output:

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...
Paul
  • 3,920
  • 31
  • 29