Convert factor variable to N binary variables

Question

I have a dataframe like below

id <- c(1,1,1,2,2,2,1,3,4,4)
product <- c("a","b","c","a","d","f","e","f","e","f") 
df <- data.frame(id,product)

   id product
1   1       a
2   1       b
3   1       c
4   2       a
5   2       d
6   2       f
7   1       e
8   3       f
9   4       e
10  4       f

I want to transform it to a dataframe as below.

id a b c d e f
1  1 1 1 0 1 0
2  1 0 0 1 0 1
3  0 0 0 0 0 1
4  0 0 0 0 1 1

Essentially, I need only one record for each id and the record should contain 0 or 1 depending on whether a product is purchased or not. I used model.matrix, but it does not group by id, I get 10 rows as in the original dataset.

@ashkan I don't think it's a duplicate of that one; the result here is an incidence matrix, not simply a rearrangement of an existing long format to wide. — Glen_b, Nov 09 '16 at 23:16

Glen_b · Answer 1 · 2016-11-09T23:27:51.283

4

as.data.frame.table (which is what is called when you as.data.frame a table) quite reasonably converts a table to long-form. To prevent that, you need to treat it as a matrix:

 as.data.frame.matrix(table(df))
  a b c d e f
1 1 1 1 0 1 0
2 1 0 0 1 0 1
3 0 0 0 0 0 1
4 0 0 0 0 1 1

edited Nov 09 '16 at 23:27

answered Nov 09 '16 at 22:57

Glen_b

7,883
2
37
48

score 3 · Answer 2 · answered Nov 09 '16 at 22:39

The reshape command is flexible, akin to PROC TRANSPOSE and all its idiosyncracies. It will give you id as a variable in the output, the missing values are uncoded levels in the output dataset. This is easily handled and reflective of real data (e.g. absence of data indicating a negative (0) condition is not data of absence in output).

df$ind <- 1

reshape(df, direction='wide', timevar='product', idvar='id')

gives

> reshape(df, direction='wide', timevar='product', idvar='id')
  id ind.a ind.b ind.c ind.d ind.f ind.e
1  1     1     1     1    NA    NA     1
4  2     1    NA    NA     1     1    NA
8  3    NA    NA    NA    NA     1    NA
9  4    NA    NA    NA    NA     1     1

and it's easy R to do the rest.

aggregate serves a similar function:

'aggregate(df$product, df[, 'id', drop=F], table)'

gives

> aggregate(df$product, df[, 'id', drop=F], table)
  id x.a x.b x.c x.d x.e x.f
1  1   1   1   1   0   1   0
2  2   1   0   0   1   0   1
3  3   0   0   0   0   0   1
4  4   0   0   0   0   1   1

and it's easy R to do the rest.

You could make this a bit simpler too - `aggregate(product ~ id, data=df, table)` or `aggregate(df["product"], df["id"], table)` depending on your preference. — thelatemail, Nov 09 '16 at 23:28

score 2 · Answer 3 · answered Nov 09 '16 at 21:49

2

Check out the help for the table function.

table(id,product)

To convert this to a data frame use

as.data.frame.matrix(table(id,product))

I found this tip in a blog post by Rronan.

answered Nov 09 '16 at 21:49

Flounderer

640
5
17

this doesn't create a dataframe in the required format. I can use as.data.frame(table(id,product)), which returns 24 obs with 3 variables, but is not what I expect. I want a dataframe with 4 obs and 6 variables. – user3897 Nov 09 '16 at 21:56
see @Glen_b's answer ... – Ben Bolker Nov 09 '16 at 23:02

score 1 · Answer 4 · answered Nov 09 '16 at 22:59

One option depending on reshape2, there are many others that may/may not require some massaging:

> reshape2::dcast(data = df,formula = id~product,fun.aggregate = length,fill = 0L)
Using product as value column: use value.var to override.
  id a b c d e f
1  1 1 1 1 0 1 0
2  2 1 0 0 1 0 1
3  3 0 0 0 0 0 1
4  4 0 0 0 0 1 1

Convert factor variable to N binary variables

4 Answers4