1

I have a matrix with a large number of duplicates and would like to obtain a matrix with the unique rows and a frequency count to each unique row.

The example shown below solves this problem but is painfully slow.

rowsInTbl <- function(tbl,row){
  sum(apply(tbl, 1, function(x) all(x == row) ))
}

colFrequency <- function(tblall){
  tbl <- unique(tblall)
  results <- matrix(nrow = nrow(tbl),ncol=ncol(tbl)+1)
  results[,1:ncol(tbl)] <- as.matrix(tbl)
  dimnames(results) <- list(c(rownames(tbl)),c(colnames(tbl),"Frequency"))

  freq <- apply(tbl,1,function(x)rowsInTbl(tblall,x))
  results[,"Frequency"] <- freq
  return(results)
}


m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)
dimnames(m) <- list(letters[1:nrow(m)],c("c1","c2"))
print("Matrix")
print(m)

[1] "Matrix"
  c1 c2
a  1  2
b  3  4
c  3  4
d  1  2
e  3  4

print("Duplicate frequency table")
print(colFrequency(m))


[1] "Duplicate frequency table"
  c1 c2 Frequency
a  1  2         2
b  3  4         3

Here are the speed measurements of the answers of @Heroka and @m0h3n compared to my example. The matrix shown above was repeated 1000 times. Data.table clearly is the fastest solution.

[1] "Duplicate frequency table - my example"
   user  system elapsed 
   0.372   0.000   0.371 

[1] "Duplicate frequency table - data.table"
   user  system elapsed 
   0.008   0.000   0.008 

[1] "Duplicate frequency table - aggregate"
   user  system elapsed 
   0.092   0.000   0.089 
scs
  • 567
  • 6
  • 22
  • In my opinion, this question differs to the question about apply functions because here the question is that an apply function is too slow for a large dataset and a different approach is needed – scs Jun 20 '16 at 14:01
  • Yes, sorry I pasted the wrong link: see [this one](http://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group), it is group by sum, but the idea is the same. – zx8754 Jun 20 '16 at 20:03
  • 1
    I checked the link and confirm the duplicate – scs Jun 21 '16 at 08:02

2 Answers2

5

Looks like a job for data.table, as you need something that can aggregate quickly.

library(data.table)


m <- matrix(c(1,2,3,4,3,4,1,2,3,4),ncol=2,byrow=T)

mdt <- as.data.table(m)

res <- mdt[,.N, by=names(mdt)]
res
# > res
# V1 V2 N
# 1:  1  2 2
# 2:  3  4 3
Heroka
  • 12,889
  • 1
  • 28
  • 38
3

How about this using base R for extracting unique rows:

mat <- matrix(c(2,5,3,5,2,3,4,2,3,5,4,2,1,5,3,5), ncol = 2, byrow = T)
mat[!duplicated(mat),]

     # [,1] [,2]
# [1,]    2    5
# [2,]    3    5
# [3,]    2    3
# [4,]    4    2
# [5,]    1    5

Extracting unique rows along with their frequencies:

m <- as.data.frame(mat)
aggregate(m, by=m, length)[1:(ncol(m)+1)]

  # V1 V2 V1.1
# 1  4  2    2
# 2  2  3    1
# 3  1  5    1
# 4  2  5    1
# 5  3  5    3
989
  • 12,579
  • 5
  • 31
  • 53