Counting number of occurences in matrix

Question

I need to count the number of values occurrences in the entire matrix.

My matrix consists only of "0" "1" "2" and I need my result to be the sum of occurrences of each of the values

If this is my matrix:

The result should be:

0 -> 10 1 -> 9 2-> 11

I am looking for an answer that uses apply. I feel like I've searched half of the internet but I didn't stumble across an answer that would be understandable or recreatable.

I ultimately want to make this apply work parallel but I think I know how to do this.

I should mention that my matrix is 30e6x18 and table is not working because of the memory issue I guess

akrun · Answer 1 · 2022-02-01T18:21:12.923

Usual way would be to unlist (if it is a data.frame) and apply the table or concatenate (c - if it is matrix to convert to vector) and apply the table.

table(unlist(m1))

 0  1  2 
10  9 11

If the dataset is really big, we can loop. Here, the number of columns seems to be 30e6 and the number of rows as 18. We may loop over the rows with a for loop and update on a temporary dataset created

out <- setNames(rep(0, 3), 0:2)
for(i in seq_len(nrow(m1))) {
       tmp <- table(m1[i, ])
       out[names(tmp)] <- out[names(tmp)] + tmp
  }
out
 0  1  2 
10  9 11

If we need a parallel option, can use dapply with parallel = TRUE, and do a colSums on the output generated

library(collapse)
colSums(dapply(m1, MARGIN = 1, FUN = \(x) 
   table(factor(x, levels = 0:2)), parallel = TRUE))
X1 X2 X3 
10  9 11

data

m1 <- structure(c(2, 1, 1, 2, 2, 0, 1, 1, 2, 1, 1, 0, 0, 2, 0, 2, 0, 
1, 2, 2, 0, 0, 0, 2, 0, 1, 2, 2, 0, 1), .Dim = c(10L, 3L), .Dimnames = list(
    NULL, c("X1", "X2", "X3")))

Mikael Jagan · Accepted Answer · 2022-02-01T19:04:38.697

If your matrix is large, then it makes sense to compute counts rowwise or columnwise to conserve memory. apply is a valid way to go about this.

Conceptually, this answer is not unlike the one I provided here for data frames. I will once again recommend that you use tabulate instead of table; it is really much more efficient.

set.seed(1L)
m <- 5L
n <- 4L
A <- matrix(sample(c("0", "1", "2"), size = m * n, replace = TRUE), m, n)
A

     [,1] [,2] [,3] [,4]
[1,] "0"  "2"  "2"  "1" 
[2,] "2"  "2"  "0"  "1" 
[3,] "0"  "1"  "0"  "1" 
[4,] "1"  "1"  "0"  "2" 
[5,] "0"  "2"  "1"  "0"

f <- function(x, levels) tabulate(factor(x, levels), length(levels))

rowSums(apply(A, 1L, f, c("0", "1", "2"))) # if 'm' has more columns than rows
## [1] 7 7 6

rowSums(apply(A, 2L, f, c("0", "1", "2"))) # if 'm' has more rows than columns
## [1] 7 7 6

You are going to want apply to loop over the smaller dimension of your matrix, so choose the second argument accordingly. If your matrix actually has millions of rows and only 18 columns, then use the second statement above, not the first.

Here is a test using a matrix with your dimensions. It only takes ~10 seconds on my machine, so parallelization might be overkill.

set.seed(1L)
m <- 3e+07L
n <- 18L
A <- matrix(sample(c("0", "1", "2"), m * n, replace = TRUE), m, n)

system.time(rowSums(apply(A, 2L, f, c("0", "1", "2"))))
##    user  system elapsed 
##   8.195   2.816  12.322

Just for fun:

library("parallel")
system.time(Reduce(`+`, mclapply(seq_len(n), function(i) f(A[, i], c("0", "1", "2")), mc.cores = 4L)))
##    user  system elapsed 
##   3.924   0.904   3.497

Counting number of occurences in matrix

2 Answers2

data