3

A seemingly easy problem is keeping me very busy.

I have a data frame:

> df1
  Name Score
1  Ben     1
2  Ben     2
3 John     1
4 John     2
5 John     3

I would like to create a summary of the table like this:

> df2
  Name Score_1 Score_2 Score_3
1  Ben       1       1       0
2 John       1       1       1

So df2 must (i) only show unique "Names" and (ii) create columns from the unique factors in "Score" and (iii) count the number of times a person received said score.

I have tried:

df2 <- ddply(df1, c("Name"), summarise
          ,Score_1 = sum(df1$Score == 1)
          ,Score_2 = sum(df1$Score == 2)
          ,Score_3 = sum(df1$Score == 3))

which produces:

  Name Score_1 Score_2 Score_3
1  Ben       2       2       1
2 John       2       2       1

So my attempt incorrectly counts all occurences instead of counting "per group"

EDIT: As per the comments, also tried reshape (possibly just doing it wrong):

> reshape(df1, idvar = "Name", timevar = "Score", direction = "wide")
  Name
1  Ben
3 John

For a start, the "Score" column is missing but worse than that, from my research on reshape, I am not convinced that I am going to get a count of each factor, which is the whole point.

Jaap
  • 81,064
  • 34
  • 182
  • 193
gmarais
  • 1,801
  • 4
  • 16
  • 32
  • 8
    Please search on SO how to convert from long to wide format. –  Feb 01 '16 at 09:05
  • 1
    With `reshape2` (or `data.table`): `dcast(df1, Name ~ paste("Score", Score, sep="_"), fun.aggregate = length)` should give the result you need. – Jaap Feb 01 '16 at 09:27
  • Thanks @Jaap. This elegantly answers the question. FWIW, the question which is point to in the duplicate flag `does not` answer my question, so based on that it is not a duplicate and should perhaps get an official answer for future use. – gmarais Feb 01 '16 at 09:33
  • @gmarais Though, it is not an exact dupe, it will give you some ideas – akrun Feb 01 '16 at 09:48
  • what about `df1$one <- 1; xtabs(one ~ Name + Score, data=df1)` ? – jogo Feb 01 '16 at 09:48
  • ok, posted as a solution now it's reopened – Jaap Feb 01 '16 at 10:10
  • A related post [Reshape three column data frame to matrix](http://stackoverflow.com/q/9617348). – Bhargav Rao Feb 01 '16 at 10:13
  • 1
    See, also, `table(df1)`. – alexis_laz Feb 01 '16 at 10:37
  • @Jaap has provided the more efficient solution for this problem while @akrun has provided a plyr-specific solution. I prefer `plyr` usually but it has its speed issues, with no exception here! I suggest using `reshape2::dcast` here as it is lightning fast. Thanks for all the suggetions – gmarais Feb 01 '16 at 11:05
  • I might be mistaken, but i think the first part in my answer is quite `plyr`-specific ;-) – Jaap Feb 01 '16 at 11:09
  • @gmarais The `dplyr/tidyr` package is not lagging behind in speed to data.table . But, you are right that `plyr` has speed issues.. – akrun Feb 01 '16 at 11:19
  • @Jaap, :) agreed but I like your `reshape2` better. @akrun, I tested `ddply` and `reshape2` on 31,317 rows: `reshape2` took 0.11s and `ddply` took 9.47s on the same data. I can't post the code here for you to scrutinise, can I inbox you or something else? – gmarais Feb 01 '16 at 11:34
  • `data.table` is highly probably even faster than `reshape2` (and you can use the same syntax as with `reshape2`) – Jaap Feb 01 '16 at 11:41

2 Answers2

3

You only need to make some slight modification to your code. You should use .(Name) instead of c("Name"):

ddply(df1, .(Name), summarise,
      Score_1 = sum(Score == 1),
      Score_2 = sum(Score == 2),
      Score_3 = sum(Score == 3))

gives:

  Name Score_1 Score_2 Score_3
1  Ben       1       1       0
2 John       1       1       1

Other possibilities include:

1. table(df1) as @alexis_laz mentioned in the comments, this gives:

> table(df1)
       Score
Name   1 2 3
  Ben  1 1 0
  John 1 1 1

2. The dcast function of the reshape2 package (or data.table which has the same dcast function):

library(reshape2) # or library(data.table)
dcast(df1, Name ~ paste0("Score_", Score), fun.aggregate = length) 

gives:

  Name Score_1 Score_2 Score_3
1  Ben       1       1       0
2 John       1       1       1
Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Seems the more efficient approach in `data.table` is to do `df1[ , .N, by = .(Name, Score = paste0("Score_", Score))]` (based on brief investigation [here](http://stackoverflow.com/a/35156328/3576984)) – MichaelChirico Feb 02 '16 at 18:03
  • @MichaelChirico thanx, seen that ;-) nice comparison – Jaap Feb 02 '16 at 19:31
2

We can use dplyr/tidyr

 library(dplyr)
 library(tidyr)
 df1 %>% 
     group_by(Name) %>%
      mutate(n=1, Score= paste('Score', Score, sep='_')) %>% 
      spread(Score, n, fill=0) 
 #     Name Score_1 Score_2 Score_3
 #  (chr)   (dbl)   (dbl)   (dbl)
 #1   Ben       1       1       0
 #2  John       1       1       1
akrun
  • 874,273
  • 37
  • 540
  • 662