Count frequency of duplicated rows

Question

I have a matrix with a large number of duplicates and would like to obtain a matrix with the unique rows and a frequency count to each unique row.

The example shown below solves this problem but is painfully slow.

rowsInTbl <- function(tbl,row){
  sum(apply(tbl, 1, function(x) all(x == row) ))
}

colFrequency <- function(tblall){
  tbl <- unique(tblall)
  results <- matrix(nrow = nrow(tbl),ncol=ncol(tbl)+1)
  results[,1:ncol(tbl)] <- as.matrix(tbl)
  dimnames(results) <- list(c(rownames(tbl)),c(colnames(tbl),"Frequency"))

  freq <- apply(tbl,1,function(x)rowsInTbl(tblall,x))
  results[,"Frequency"] <- freq
  return(results)
}


m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
dimnames(m) <- list(letters[1:nrow(m)],c("c1","c2"))
print("Matrix")
print(m)

[1] "Matrix"
  c1 c2
a  1  2
b  3  4
c  3  4
d  1  2
e  3  4

print("Duplicate frequency table")
print(colFrequency(m))


[1] "Duplicate frequency table"
  c1 c2 Frequency
a  1  2         2
b  3  4         3

Here are the speed measurements of the answers of @Heroka and @m0h3n compared to my example. The matrix shown above was repeated 1000 times. Data.table clearly is the fastest solution.

[1] "Duplicate frequency table - my example"
   user  system elapsed 
   0.372   0.000   0.371 

[1] "Duplicate frequency table - data.table"
   user  system elapsed 
   0.008   0.000   0.008 

[1] "Duplicate frequency table - aggregate"
   user  system elapsed 
   0.092   0.000   0.089

In my opinion, this question differs to the question about apply functions because here the question is that an apply function is too slow for a large dataset and a different approach is needed — scs, Jun 20 '16 at 14:01
Yes, sorry I pasted the wrong link: see [this one](http://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group), it is group by sum, but the idea is the same. — zx8754, Jun 20 '16 at 20:03

score 5 · Accepted Answer · answered Jun 20 '16 at 12:12

5

Looks like a job for data.table, as you need something that can aggregate quickly.

library(data.table)


m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)

mdt <- as.data.table(m)

res <- mdt[,.N, by=names(mdt)]
res
# > res
# V1 V2 N
# 1:  1  2 2
# 2:  3  4 3

answered Jun 20 '16 at 12:12

Heroka

12,889
1
28
38

Fastest solutions of all answers. Speed measurements added to question. – scs Jun 20 '16 at 13:40

989 · Answer 2 · 2016-06-20T12:24:01.297

How about this using base R for extracting unique rows:

mat <- matrix(c(2,5,3,5,2,3,4,2,3,5,4,2,1,5,3,5), ncol = 2, byrow = T)
mat[!duplicated(mat),]

     # [,1] [,2]
# [1,]    2    5
# [2,]    3    5
# [3,]    2    3
# [4,]    4    2
# [5,]    1    5

Extracting unique rows along with their frequencies:

m <- as.data.frame(mat)
aggregate(m, by=m, length)[1:(ncol(m)+1)]

  # V1 V2 V1.1
# 1  4  2    2
# 2  2  3    1
# 3  1  5    1
# 4  2  5    1
# 5  3  5    3

Count frequency of duplicated rows

2 Answers2