2

How can I create a categorical variable from mutually exclusive dummy variables (taking values 0/1)?

Basically I am looking for the exact opposite of this solution: (https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables).

Would appreciate a base R solution.

For example, I have the following data:

dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                        0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                        0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), 
            .Dim = c(10L, 4L), 
            .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
          State.NJ State.NY State.TX State.VA
     [1,]        1        0        0        0
     [2,]        0        1        0        0
     [3,]        1        0        0        0
     [4,]        0        0        0        1
     [5,]        0        1        0        0
     [6,]        0        0        1        0
     [7,]        1        0        0        0
     [8,]        0        0        0        1
     [9,]        0        0        1        0
    [10,]        0        0        0        1

I would like to get the following results

   state
1     NJ
2     NY
3     NJ
4     VA
5     NY
6     TX
7     NJ
8     VA
9     TX
10    VA

cat.var <- structure(list(state = structure(c(1L, 2L, 1L, 4L, 2L, 3L, 1L, 
4L, 3L, 4L), .Label = c("NJ", "NY", "TX", "VA"), class = "factor")), 
                    class = "data.frame", row.names = c(NA, -10L))
M--
  • 25,431
  • 8
  • 61
  • 93
ECII
  • 10,297
  • 18
  • 80
  • 121
  • 2
    Duplicate of [Reconstruct a categorical variable from dummies in R](https://stackoverflow.com/questions/49130366/reconstruct-a-categorical-variable-from-dummies-in-r) – M-- Jan 27 '20 at 21:39

4 Answers4

5
# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))

df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])

Result:

> df
  a b c cat
1 1 0 0   a
2 0 1 0   b
3 0 0 1   c
4 0 1 0   b
5 0 0 1   c

To generalize, you'll need to play with the df and names(df) part, but you get the drift. One option would be to make a function, e.g.,

catmaker <- function(data, varnames, catname) {

  data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])

  return(data)

}

newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")

One nice aspect of the functional approach is that it is robust to variations in the order of names in the vector of column names you feed into it. I.e., varnames = c("c", "a", "b") produces the same result as varnames = c("a", "b", "c").

P.S. You added some example data after I posted this. The function works on your example, as long as you convert dummy.df to a data frame first, e.g., catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State") does the job.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
2

You can use tidyr::gather:

library(dplyr)
library(tidyr)

as_tibble(dummy.df) %>%  
  mutate(id =1:n()) %>% 
  pivot_longer(., -id, values_to = "Value", 
                  names_to = c("txt","State"), names_sep = "\\.") %>% 
  filter(Value ==1) %>%  select(State)  
#> # A tibble: 10 x 1
#>    State
#>    <chr>
#>  1 NJ   
#>  2 NY   
#>  3 NJ   
#>  4 VA   
#>  5 NY   
#>  6 TX   
#>  7 NJ   
#>  8 VA   
#>  9 TX   
#> 10 VA
M--
  • 25,431
  • 8
  • 61
  • 93
2

You can do:

states <- names(dummy.df)[max.col(dummy.df)]

Or if as in your example it's a matrix you'd need to use colnames():

colnames(dummy.df)[max.col(dummy.df)]

Then just clean it up with sub():

sub(".*\\.", "", states)

"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
1

EDIT : with your data

One way with model.matrix for dummy creation and matrix multiplication :

dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                      0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                      0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
                      ), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", 
                                                  "State.VA")))
level_names <- colnames(dummy.df)

# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)

# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#>  [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#>  [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA

General reprex :

# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
                  y = 1:9)
dfr
#>   vec y
#> 1   a 1
#> 2   a 2
#> 3   a 3
#> 4   b 4
#> 5   b 5
#> 6   b 6
#> 7   c 7
#> 8   c 8
#> 9   c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)

# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)

# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c
cbo
  • 1,664
  • 1
  • 12
  • 27