Converting a categorical variable to multiple binary variables

Question

I wish to convert part of my data to binary wide format.

This is my input:

mydf <- data.frame( transaction =c (1,0,1,1,1,0,0), quality = c("NEW", "OLD","OLD", "OLD","OLD","NEW","NEW"), brand = c(1,2,3,1,2,2,1))

     transaction quality brand
1           1     NEW     1
2           0     OLD     2
3           1     OLD     3
4           1     OLD     1
5           1     OLD     2
6           0     NEW     2
7           0     NEW     1
>

and I wish to convert the brand column to wide format so that have the following output

     transaction quality brand_1 brand_2 brand_3
1           1     NEW     1        0      0
2           0     OLD     0        1      0
3           1     OLD     0        0      1
4           1     OLD     1        0      0
5           1     OLD     0        1      0
6           0     NEW     0        1      0
7           0     NEW     1        0      0

I tried different approaches such as model.matrix function but couldn't reach to my desired output.

score 2 · Accepted Answer · answered Oct 06 '17 at 06:20

For every row we select it's corresponding column which needs to be changed to 1. We generate the row/column combination by using seq(for selecting rows) and paste0 (to select columns). For all those row/column combination we use mapply to change all the corresponding values to 1 using the not-so-famous global assignment operator.

#Generate new columns to be added
cols <- paste0("brand-", 1:3)
#Initialise the columns to 0
mydf[cols] <- 0

mapply(function(x, y) mydf[x, y] <<- 1, seq(nrow(mydf)), 
                                        paste0("brand-", mydf$brand))

mydf

#  transaction quality brand brand-1 brand-2 brand-3
#1           1     NEW     1       1       0       0
#2           0     OLD     2       0       1       0
#3           1     OLD     3       0       0       1
#4           1     OLD     1       1       0       0
#5           1     OLD     2       0       1       0
#6           0     NEW     2       0       1       0
#7           0     NEW     1       1       0       0

We can remove the orginal brand column if we no longer require it using

mydf$brand <- NULL

score 1 · Answer 2 · answered Oct 06 '17 at 06:24

For a tidy approach

library(dplyr)
library(tidyr)
library(tibble)

mydf %>%
  rownames_to_column() %>%
  group_by(rowname, transaction, quality, brand) %>%
  summarise(count = n()) %>%
  spread(brand, count, sep = "-", fill = 0) %>%
  ungroup() %>%
  select(-rowname)

# # A tibble: 7 x 5
#   transaction quality `brand-1` `brand-2` `brand-3`
# *       <dbl>  <fctr>     <dbl>     <dbl>     <dbl>
# 1           1     NEW         1         0         0
# 2           0     OLD         0         1         0
# 3           1     OLD         0         0         1
# 4           1     OLD         1         0         0
# 5           1     OLD         0         1         0
# 6           0     NEW         0         1         0
# 7           0     NEW         1         0         0

Converting a categorical variable to multiple binary variables

2 Answers2