4

I'd like to group this data but apply different functions to some columns when grouping.

ID  type isDesc isImage
1   1    1      0
1   1    0      1
1   1    0      1
4   2    0      1
4   2    1      0
6   1    1      0
6   1    0      1
6   1    0      0

I want to group by ID, columns isDesc and isImage can be summed, but I would like to get the value of type as it is. type will be the same through the whole dataset. The result should look like this:

ID  type isDesc isImage
1   1    1      2
4   2    1      1
6   1    1      1

Currently I am using

library(plyr)
summarized = ddply(data, .(ID), numcolwise(sum))

but it simply sums up all the columns. You don't have to use ddply but if you think it's good for the job I'd like to stick to it. data.table library is also an alternative

Alan
  • 739
  • 6
  • 18
  • What will you do when there's more than one type for an ID? Just take one (chosen by whatever means) or is that you really want to group by both ID and type? This sounds like an SQL query (see "group by"). – igelkott Mar 15 '13 at 15:05

1 Answers1

6

Using data.table:

require(data.table)
dt <- data.table(data, key="ID")
dt[, list(type=type[1], isDesc=sum(isDesc), 
                  isImage=sum(isImage)), by=ID]

#    ID type isDesc isImage
# 1:  1    1      1       2
# 2:  4    2      1       1
# 3:  6    1      1       1

Using plyr:

ddply(data , .(ID), summarise, type=type[1], isDesc=sum(isDesc), isImage=sum(isImage))
#   ID type isDesc isImage
# 1  1    1      1       2
# 2  4    2      1       1
# 3  6    1      1       1

Edit: Using data.table's .SDcols, you can do this in case you've too many columns that are to be summed, and other columns to be just taken the first value.

dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
> dt2[dt1]
#    ID type isDesc isImage
# 1:  1    1      1       2
# 2:  4    2      1       1
# 3:  6    1      1       1

You can provide column names or column numbers as arguments to .SDcols. Ex: .SDcols=c("type") is also valid.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Hi again Arun :D, is there a possibility to define, "all columns except x"? I am dealing with about 70 columns, most of them will be `sum` but only a few should be `unique`. Tnx – Alan Mar 15 '13 at 13:57
  • tnx, I think it works, I just added the setkey so the code is complete – Alan Mar 15 '13 at 14:38
  • 1
    @frinx, I've already set dt's key to `ID` at the top. When you use `by` with `key-column` and store it in a `data.table`, the result automatically has the same key. Meaning, `dt2` and `dt1`'s keys are already `ID`. Please write here before editing, to verify. – Arun Mar 15 '13 at 14:43
  • I missed the top part with key. Sry. However, without setting the key as you did and when using `by` it doesn't want to merge the two sets. It throws an error asking for the key. I'll revert the edit. – Alan Mar 15 '13 at 14:44
  • Without setting the key, you can use `by`, but the resulting `data.table`'s key won't be set. – Arun Mar 15 '13 at 14:46
  • 1
    +1 There's also `keyby` instead of `by` maybe, to remove any uncertainty/dependency on a previous `setkey`. I suppose the problem with `cbind(dt2,dt1)` is then the `ID` column would appear in the result twice. Hm. – Matt Dowle Mar 15 '13 at 16:20