12

My dataframe looks like this:

ID | value 1 | value 2 | value 3 | value 4
1  |    M    |    D    |    F    |   A
2  |    F    |    M    |    G    |   B
3  |    M    |    D    |    F    |   A
4  |    L    |    D    |    E    |   B

I want to get something like this.

value 1 | value 2 | value 3 | value 4|  Number of combinations
  M     |    D    |    F    |   A    |     2
  F     |    M    |    G    |   B    |     1
  L     |    D    |    E    |   B    |     1

e.g. to count the number of unique combinations of the columns value 1 - value 4.

Frank
  • 66,179
  • 8
  • 96
  • 180
Elisa
  • 215
  • 1
  • 3
  • 11

4 Answers4

13

count in plyr package will do that task.

> df
  ID   value.1   value.2   value.3 value.4
1  1     M         D         F           A
2  2     F         M         G           B
3  3     M         D         F           A
4  4     L         D         E           B
> library(plyr)
> count(df[, -1])
    value.1   value.2   value.3 value.4 freq
1     F         M         G           B    1
2     L         D         E           B    1
3     M         D         F           A    2
kohske
  • 65,572
  • 8
  • 165
  • 155
12
N <- 10000

d <- data.frame(
  ID=seq(1, N), 
  v1=sample(c("M","F", "M", "L"), N, replace = TRUE), 
  v2=sample(c("D","M","D","D"), N, replace = TRUE), 
  v3=sample(c("F","G","F","E"), N, replace = TRUE),
  v4=sample(c("A","B","A","B"), N, replace = TRUE)
)

With data.table (fastest)

dt <- data.table::as.data.table(d)
dt[, .N, by = c('v1','v2','v3','v4')]

With dplyr

dplyr::count_(d, vars = c('v1','v2','v3','v4'))

With plyr

plyr::count(d, vars = c('v1','v2','v3','v4'))
plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow)

With aggregate (slowest)

aggregate(ID ~ ., d, FUN = length)

Benchmark

microbenchmark::microbenchmark(dt[, .N, by = c('v1','v2','v3','v4')],
                               plyr::count(d, vars = c('v1','v2','v3','v4')),
                               plyr::ddply(d, .variables = c('v1','v2','v3','v4'), nrow),
                               dplyr::count_(d, vars = c('v1','v2','v3','v4')),
                               aggregate(ID ~ ., d, FUN = length), 
                               times = 1000)

Unit: microseconds
                                                         expr      min       lq      mean   median        uq        max neval  cld
                     dt[, .N, by = c("v1", "v2", "v3", "v4")]  887.807 1107.543  1263.777 1174.258  1289.724   4263.156  1000 a   
             plyr::count(d, vars = c("v1", "v2", "v3", "v4")) 3912.791 4270.387  5379.080 4498.053  5791.743 157146.103  1000   c 
 plyr::ddply(d, .variables = c("v1", "v2", "v3", "v4"), nrow) 7737.874 8553.370 10630.849 9018.266 11126.517 187301.696  1000    d
           dplyr::count_(d, vars = c("v1", "v2", "v3", "v4")) 2126.913 2432.957  2763.499 2568.251  2789.386  12549.669  1000  b  
                           aggregate(ID ~ ., d, FUN = length) 7395.440 8121.828 10546.659 8776.371 10858.263 210139.759  1000    d

It seems best to simply use data.table instead of the data.frame as it is fastest and doesn't need an other function or library to count. Note also that aggregate function performs much slower on large data sets.

Final note: feel free to update with new methods.

Davor Josipovic
  • 5,296
  • 1
  • 39
  • 57
  • I think you should show the results for a somewhat larger data set. Also, I guess `plyr::count(d, c('v1','v2','v3','v4'))` might be correct, after looking at `args(plyr::count)`. There is probably also a `dplyr::count` that could be considered. – Frank Jun 05 '17 at 15:12
  • @Frank, it is now based on 10k rows. – Davor Josipovic Jun 05 '17 at 15:17
  • Ok, thanks. One other point: `length(ID)` doesn't seem like it would give the correct result now, since `ID` repeats values. It's generally a good idea to test for equality of results across the methods. Oh never mind, I guess it doesn't matter what values it has. Fwiw, the reason data.table is fast is documented in `?GForce`. – Frank Jun 05 '17 at 15:18
  • They give a different format, but `tapply(d$ID, d[, -1], length)` and `table(d[, -1])` are also fast for this particular case. – Frank Jun 05 '17 at 15:40
6

Without plyr.

aggregate(ID ~ ., d, FUN=length)# . means all variables in d except ID
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27
  • Nice, but very (!!) slow compared to plyr::count. From `microbenchmark` test with 4000x500 DF summarizing frequencies for 3 columns, its seems that `count` is 20x (!) faster. – Davor Josipovic Jun 05 '17 at 11:38
0

Here a solution using the plyr package

library(plyr)
d <- data.frame(
    ID=seq(1,4), v1=c("M","F", "M", "L"), 
    v2=c("D","M","D","D"), v3=c("F","G","F","E"), v4=c("A","B","A","B")
)
ddply(d,.(v1,v2,v3,v4), nrow)

I hope this was not homework...

Karsten W.
  • 17,826
  • 11
  • 69
  • 103