R - count all combinations

Question

I would like to count all combinations in a data.frame.

The data look like this

   9 10 11 12
1  1  1  1  1
2  0  0  0  0
3  0  0  0  0
4  1  1  1  1
5  1  1  1  1
6  0  0  0  0
7  1  0  0  1
8  1  0  0  1
9  1  1  1  1
10 1  1  1  1

The output I want is simply

Do you know any simple function to do that ?

Thanks

dt = structure(list(`9` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1), `10` = c(1, 
0, 0, 1, 1, 0, 0, 0, 1, 1), `11` = c(1, 0, 0, 1, 1, 0, 0, 0, 
1, 1), `12` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1)), .Names = c("9", 
"10", "11", "12"), class = "data.frame", row.names = c(NA, -10L
))

If your data is indeed binary you could avoid any database operation by mapping each row to a unique decimal number using `(as.matrix(dt) %*% 2 ^ (0:(length(dt) - 1)))[, 1L]` and continue from there with `duplcated` and `tabulate` to match your desired output. See this [QA](http://stackoverflow.com/questions/22886040/how-should-i-count-the-number-of-unique-rows-in-a-binary-matrix). — alexis_laz, Dec 18 '15 at 15:40

akrun · Accepted Answer · 2015-12-16T15:47:11.417

We can either use data.table or dplyr. These are very efficient. We convert the 'data.frame' to 'data.table' (setDT(dt)), grouped by all the columns of 'dt' (names(dt)), we get the nrow (.N) as the 'Count'

library(data.table)
setDT(dt)[,list(Count=.N) ,names(dt)]

Or we can use a similar methodology using dplyr.

library(dplyr)
names(dt) <- make.names(names(dt))
dt %>%
   group_by_(.dots=names(dt)) %>%
   summarise(count= n())

Benchmarks

In case somebody wants to look at some metrics (and also to backup my claim earlier (efficient!)),

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 1e6*6, replace=TRUE), ncol=6))

akrunDT <-  function() {
  as.data.table(df1)[,list(Count=.N) ,names(df1)]
 }

akrunDplyr <- function() {
  df1 %>%
    group_by_(.dots=names(df1)) %>%
    summarise(count= n())
}

cathG <- function() {
 aggregate(cbind(n = 1:nrow(df1))~., df1, length)
  }

docendoD <- function() {
  as.data.frame(table(comb = do.call(paste, df1)))
}

deena <- function() {
   table(apply(df1, 1, paste, collapse = ","))
}

Here are the microbenchmark results

library(microbenchmark)
microbenchmark(akrunDT(), akrunDplyr(), cathG(), docendoD(),  deena(),
  unit='relative', times=20L)
#   Unit: relative
#        expr       min        lq      mean   median        uq        max neval  cld
#     akrunDT()  1.000000  1.000000  1.000000  1.00000  1.000000  1.0000000    20     a   
#  akrunDplyr()  1.512354  1.523357  1.307724  1.45907  1.365928  0.7539773    20     a   
#       cathG() 43.893946 43.592062 37.008677 42.10787 38.556726 17.9834245    20    c 
#    docendoD() 18.778534 19.843255 16.560827 18.85707 17.296812  8.2688541    20    b  
#       deena() 90.391417 89.449547 74.607662 85.16295 77.316143 34.6962954    20    d

thank @akrun but the `dplyr` solution does not work. I get `9 10 11 12 count 1 9 10 11 12 10`. — giac, Dec 16 '15 at 12:48
@giacomoV I guess it is based on the unusual column names that start with numbers. — akrun, Dec 16 '15 at 12:50
@giacomoV Please check the updated code. It should work now. — akrun, Dec 16 '15 at 12:52
btw akrun, I did upvote you, I wanted to find a base R sol but I do like really much the data.table one and am convinced it's more efficient ;-) and I do think it deserves more upvotes (didn't follow all the discussion about the votes, just saw the last comments) — Cath, Dec 16 '15 at 13:57
the `.dots` argument in group_by is wicked! As usual, the code could be slightly shortened for the dplyr version, just using `count(.dots = ...)` instead of `group_by(...) %>% summarise(n())` — tjebo, Dec 02 '20 at 19:18
@Tjebo that is an old code when the `group_by_` was still around. Now, the `_` is deprecated — akrun, Dec 02 '20 at 19:59

Cath · Answer 2 · 2015-12-16T14:34:02.307

A base R solution with aggregate:

aggregate(seq(nrow(dt))~., data=dt, FUN=length)
#  9 10 11 12 seq(nrow(dt))
#1 0  0  0  0             3
#2 1  0  0  1             2
#3 1  1  1  1             5

edit

To get colnames more conformed to your output, you can do:

`colnames<-`(aggregate(seq(nrow(dt))~., data=dt, FUN=length), c("c", "o", "m", "b", "n"))
#  c o m b n
#1 0 0 0 0 3
#2 1 0 0 1 2
#3 1 1 1 1 5

Or, shorter:

aggregate(cbind(n = 1:nrow(dt))~., dt, length)
#  9 10 11 12 n
#1 0  0  0  0 3
#2 1  0  0  1 2
#3 1  1  1  1 5

score 8 · Answer 3 · answered Dec 16 '15 at 12:52

8

You could try the following approach using only base R:

as.data.frame(table(comb = do.call(paste, dt)))
#     comb Freq
#1 0 0 0 0    3
#2 1 0 0 1    2
#3 1 1 1 1    5

answered Dec 16 '15 at 12:52

talat

68,970
21
126
157

1

To exactly match the desired output (count column as "n") you could slightly adjust the code to `as.data.frame(table(comb = do.call(paste, dt)), responseName = "n")`. – talat Dec 16 '15 at 13:10

score 5 · Answer 4 · answered Dec 16 '15 at 12:53

5

Maybe that too : table(apply(dt, 1, paste, collapse = ","))

answered Dec 16 '15 at 12:53

score 3 · Answer 5 · answered Dec 16 '15 at 12:54

Also in base R:

Use unique.matrix to get the list of unique combinations.

uncs <- unique.matrix(as.matrix(df), MARGIN = 1)

Then make comparisons and count:

cnts <- colSums(apply(uncs, 1, function(r) apply(dt, 1, function(r2) all(r == r2))))
cbind(comb = apply(uncs, 1, paste), n = cnts)

Mike Dolan Fliss · Answer 6 · 2021-05-24T14:31:03.030

The dplyr solution above could have been done easier with group_by_all()...

dt %>% group_by_all %>% count

...which as I understand has been superseded by the across() method. Adding in a bit of sorting, and you get:

dt %>% group_by(across()) %>% count %>% arrange(desc(n))

> dt %>% group_by(across()) %>% count %>% arrange(desc(n))
# A tibble: 3 x 5
# Groups:   9, 10, 11, 12 [3]
    `9`  `10`  `11`  `12`     n
  <dbl> <dbl> <dbl> <dbl> <int>
1     1     1     1     1     5
2     0     0     0     0     3
3     1     0     0     1     2

Which you could cast to a matrix if you wished.

R - count all combinations

6 Answers6

Benchmarks

Linked