Create categorical variable from mutually exclusive dummy variables

Question

How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?

Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).

Would appreciate a base R solution.

For example, I have the following data:

dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                        0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                        0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), 
            .Dim = c(10L, 4L), 
            .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))

          State.NJ State.NY State.TX State.VA
     [1,]        1        0        0        0
     [2,]        0        1        0        0
     [3,]        1        0        0        0
     [4,]        0        0        0        1
     [5,]        0        1        0        0
     [6,]        0        0        1        0
     [7,]        1        0        0        0
     [8,]        0        0        0        1
     [9,]        0        0        1        0
    [10,]        0        0        0        1

I would like to get the following results

   state
1     NJ
2     NY
3     NJ
4     VA
5     NY
6     TX
7     NJ
8     VA
9     TX
10    VA

cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L, 
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")), 
                    class = "data.frame", row.names = c(NA, -10L))

Duplicate of [Reconstruct a categorical variable from dummies in R](https://stackoverflow.com/questions/49130366/reconstruct-a-categorical-variable-from-dummies-in-r) — M--, Jan 27 '20 at 21:39

ulfelder · Answer 1 · 2020-01-27T20:47:54.250

# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))

df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])

Result:

> df
  a b c cat
1 1 0 0   a
2 0 1 0   b
3 0 0 1   c
4 0 1 0   b
5 0 0 1   c

To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,

catmaker <- function(data, varnames, catname) {

  data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])

  return(data)

}

newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")

One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").

P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.

M-- · Answer 2 · 2020-01-27T20:56:41.277

2

You can use tidyr::gather:

library(dplyr)
library(tidyr)

as_tibble(dummy.df) %>%  
  mutate(id =1:n()) %>% 
  pivot_longer(., -id, values_to = "Value", 
                  names_to = c("txt","State"), names_sep = "\\.") %>% 
  filter(Value ==1) %>%  select(State)

#> # A tibble: 10 x 1
#>    State
#>    <chr>
#>  1 NJ   
#>  2 NY   
#>  3 NJ   
#>  4 VA   
#>  5 NY   
#>  6 TX   
#>  7 NJ   
#>  8 VA   
#>  9 TX   
#> 10 VA

edited Jan 27 '20 at 20:56

answered Jan 27 '20 at 20:40

M--

25,431
8
61
93

this is actually the exact opposite from what I am asking – ECII Jan 27 '20 at 20:41
1

Yeah, I am used to see the input first and then the output. I will edit shortly. – M-- Jan 27 '20 at 20:42

Ritchie Sacramento · Answer 3 · 2020-01-27T20:58:07.820

2

You can do:

states <- names(dummy.df)[max.col(dummy.df)]

Or if as in your example it's a matrix you'd need to use colnames():

colnames(dummy.df)[max.col(dummy.df)]

Then just clean it up with sub():

sub(".*\\.", "", states)

"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"

edited Jan 27 '20 at 20:58

answered Jan 27 '20 at 20:48

Ritchie Sacramento

29,890
4
48
56

cbo · Answer 4 · 2020-01-27T20:45:40.017

EDIT : with your data

One way with model.matrix for dummy creation and matrix multiplication :

dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                      0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                      0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
                      ), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", 
                                                  "State.VA")))
level_names <- colnames(dummy.df)

# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)

# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#>  [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#>  [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA

General reprex :

# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
                  y = 1:9)
dfr
#>   vec y
#> 1   a 1
#> 2   a 2
#> 3   a 3
#> 4   b 4
#> 5   b 5
#> 6   b 6
#> 7   c 7
#> 8   c 8
#> 9   c 9

# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)

# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)

# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c

Create categorical variable from mutually exclusive dummy variables

4 Answers4