1

I have a dataframe like below

id <- c(1,1,1,2,2,2,1,3,4,4)
product <- c("a","b","c","a","d","f","e","f","e","f") 
df <- data.frame(id,product)

   id product
1   1       a
2   1       b
3   1       c
4   2       a
5   2       d
6   2       f
7   1       e
8   3       f
9   4       e
10  4       f

I want to transform it to a dataframe as below.

id a b c d e f
1  1 1 1 0 1 0
2  1 0 0 1 0 1
3  0 0 0 0 0 1
4  0 0 0 0 1 1

Essentially, I need only one record for each id and the record should contain 0 or 1 depending on whether a product is purchased or not. I used model.matrix, but it does not group by id, I get 10 rows as in the original dataset.

user3897
  • 551
  • 2
  • 5
  • 14
  • Shouldn't this be on StackOverflow? – Jon Nov 09 '16 at 22:41
  • @ashkan I don't think it's a duplicate of that one; the result here is an incidence matrix, not simply a rearrangement of an existing long format to wide. – Glen_b Nov 09 '16 at 23:16

4 Answers4

4

as.data.frame.table (which is what is called when you as.data.frame a table) quite reasonably converts a table to long-form. To prevent that, you need to treat it as a matrix:

 as.data.frame.matrix(table(df))
  a b c d e f
1 1 1 1 0 1 0
2 1 0 0 1 0 1
3 0 0 0 0 0 1
4 0 0 0 0 1 1
Glen_b
  • 7,883
  • 2
  • 37
  • 48
3

The reshape command is flexible, akin to PROC TRANSPOSE and all its idiosyncracies. It will give you id as a variable in the output, the missing values are uncoded levels in the output dataset. This is easily handled and reflective of real data (e.g. absence of data indicating a negative (0) condition is not data of absence in output).

df$ind <- 1

reshape(df, direction='wide', timevar='product', idvar='id')

gives

> reshape(df, direction='wide', timevar='product', idvar='id')
  id ind.a ind.b ind.c ind.d ind.f ind.e
1  1     1     1     1    NA    NA     1
4  2     1    NA    NA     1     1    NA
8  3    NA    NA    NA    NA     1    NA
9  4    NA    NA    NA    NA     1     1

and it's easy R to do the rest.

aggregate serves a similar function:

'aggregate(df$product, df[, 'id', drop=F], table)'

gives

> aggregate(df$product, df[, 'id', drop=F], table)
  id x.a x.b x.c x.d x.e x.f
1  1   1   1   1   0   1   0
2  2   1   0   0   1   0   1
3  3   0   0   0   0   0   1
4  4   0   0   0   0   1   1

and it's easy R to do the rest.

AdamO
  • 4,283
  • 1
  • 27
  • 39
  • You could make this a bit simpler too - `aggregate(product ~ id, data=df, table)` or `aggregate(df["product"], df["id"], table)` depending on your preference. – thelatemail Nov 09 '16 at 23:28
2

Check out the help for the table function.

table(id,product)

To convert this to a data frame use

as.data.frame.matrix(table(id,product))

I found this tip in a blog post by Rronan.

Flounderer
  • 640
  • 5
  • 17
  • this doesn't create a dataframe in the required format. I can use as.data.frame(table(id,product)), which returns 24 obs with 3 variables, but is not what I expect. I want a dataframe with 4 obs and 6 variables. – user3897 Nov 09 '16 at 21:56
  • see @Glen_b's answer ... – Ben Bolker Nov 09 '16 at 23:02
1

One option depending on reshape2, there are many others that may/may not require some massaging:

> reshape2::dcast(data = df,formula = id~product,fun.aggregate = length,fill = 0L)
Using product as value column: use value.var to override.
  id a b c d e f
1  1 1 1 1 0 1 0
2  2 1 0 0 1 0 1
3  3 0 0 0 0 0 1
4  4 0 0 0 0 1 1
joran
  • 169,992
  • 32
  • 429
  • 468