6

I have a data.table in r

    col1 col2 col3   col4
 1:  5.1  3.5  1.4 setosa
 2:  5.1  3.5  1.4 setosa
 3:  4.7  3.2  1.3 setosa
 4:  4.6  3.1  1.5 setosa
 5:  5.0  3.6  1.4 setosa
 6:  5.1  3.5  3.4    eer
 7:  5.1  3.5  3.4    eer
 8:  5.1  3.2  1.3    eer
 9:  5.1  3.5  1.5    eer
10:  5.1  3.5  1.4    eer


DT <- structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 5.1, 
5.1, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.5, 
3.5), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 1.4
), col4 = structure(c(1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L), .Label = c("setosa", 
"versicolor", "virginica", "eer"), class = "factor")), .Names = c("col1", 
"col2", "col3", "col4"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

I want to count unique (distinct) combinations of col1 and col2 for each value of col4.

Expected output is

   col1 col2 col3   col4 count
 1:  5.1  3.5  1.4 setosa     4
 2:  5.1  3.5  1.4 setosa     4
 3:  4.7  3.2  1.3 setosa     4
 4:  4.6  3.1  1.5 setosa     4
 5:  5.0  3.6  1.4 setosa     4
 6:  5.1  3.5  3.4    eer     2
 7:  5.1  3.5  3.4    eer     2
 8:  5.1  3.2  1.3    eer     2
 9:  5.1  3.5  1.5    eer     2
10:  5.1  3.5  1.4    eer     2

How can I do this in 1 data.table syntax only?

Henrik
  • 65,555
  • 14
  • 143
  • 159
user3664020
  • 2,980
  • 6
  • 24
  • 45
  • Here are some other, non-data.table ways: http://stackoverflow.com/questions/17421776/how-to-add-count-of-unique-values-by-group-to-r-data-frame – Frank Mar 26 '16 at 14:34
  • Do we want to keep this question data.table only, or expand it to allow dplyr et al? I ask because it would make a perfect duplicate target for e.g. [this](https://stackoverflow.com/questions/51867884/r-count-distinct-elements-based-on-two-columns-by-group) – smci Aug 16 '18 at 01:50

1 Answers1

17

I had to go through a few attempts first, and ended up with this. Any good?

DT[, count:=nrow(unique(.SD)), by=col4, .SDcols=c("col1","col2")]
DT
    col1 col2 col3   col4 count
 1:  5.1  3.5  1.4 setosa     4
 2:  5.1  3.5  1.4 setosa     4
 3:  4.7  3.2  1.3 setosa     4
 4:  4.6  3.1  1.5 setosa     4
 5:  5.0  3.6  1.4 setosa     4
 6:  5.1  3.5  3.4    eer     2
 7:  5.1  3.5  3.4    eer     2
 8:  5.1  3.2  1.3    eer     2
 9:  5.1  3.5  1.5    eer     2
10:  5.1  3.5  1.4    eer     2
> 

and the same but faster thanks to Procrastinatus comment below :

DT[, count:=uniqueN(.SD), by=col4, .SDcols=c("col1","col2")]
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    I'm quite surprised you're not using `uniqueN`. Any specific reason for that? – Jaap Mar 26 '16 at 09:01
  • Yes ... Arun's been forging ahead and I'd forgotten about uniqueN! Nice. Looking at it, should we set its `by=NULL` in case `DT` is keyed by `col1`? – Matt Dowle Mar 26 '16 at 09:10
  • That will only give the count for unique values in `col1` when I try that (which isn't what OP wanted). Btw: with regard to speed, it seems that using `length(unique())` is [still faster than `uniqueN`](http://stackoverflow.com/questions/34007199/frequency-of-rows-by-id) on large datasets. – Jaap Mar 26 '16 at 09:20
  • @ProcrastinatusMaximus What will give the count for unique values in `col1`? - sorry not following. If `DT` is keyed by `col1`, I was thinking `.SD` might retain the key and hence `uniqueN`'s `by=` comes into it. So `uniqueN(.SD,by=NULL)` to be safe. – Matt Dowle Mar 26 '16 at 09:28
  • 2
    @MattDowle, yes that'd be necessary, until `by=seq_along(x)` is made the default. – Arun Mar 26 '16 at 09:29
  • 2
    On the speed difference link, isn't _length_ unique returning the number of _columns_ and so is a different result? – Matt Dowle Mar 26 '16 at 09:30
  • @MattDowle, right. In this case, `unique()` from base R isn't helpful. What he says is valid for vectors. For many groups each with only a few values, `length(unique(vector))` seems tad faster. – Arun Mar 26 '16 at 09:35
  • On the `length` part you are right. On the `by=NULL` part, I see that I misunderstood you (thought you wanted to use that in `by` of the `data.table`). – Jaap Mar 26 '16 at 09:38
  • Ok got it. Thanks both. – Matt Dowle Mar 26 '16 at 09:43
  • @MattDowle can you also please try to answer my other question? The solution given there by `mtoto` is not working in some cases. http://stackoverflow.com/questions/36232684/add-rows-in-a-data-table-but-not-when-certain-columns-take-same-values – user3664020 Mar 26 '16 at 10:00
  • Not sure about speed, but `DT[, count := uniqueN(paste(col1, col2, sep = "-")), by = col4]` could be another way. – jazzurro Mar 27 '16 at 01:32