If data
has modest dimensions, then you can do:
f1 <- function(data, levels) {
c(table(factor(unlist(data, FALSE, FALSE), levels)))
}
f1(data, c("0/1", "1/1", "0/0"))
If not, then you may need a different function, because f1
requires you to allocate memory for five prod(dim(data))
-length vectors: the unlist
result, the factor
result, and three intermediate objects inside of table
([1],
[2],
[3]).
f2
below is more verbose but much more efficient:
- It computes column-wise counts then takes their sum to obtain the result. In this way, it avoids creating vectors of length greater than
nrow(data)
.
- It uses
tabulate
instead of table
to do the counting. You can think of tabulate
as a low level analogue of table
. With some care, you can use tabulate
to obtain the table
result without any of the associated overhead.
f2 <- function(data, levels) {
tt <- function(x, levels) tabulate(factor(x, levels), length(levels))
cc <- vapply(data, tt, integer(3L), levels, USE.NAMES = FALSE)
res <- as.integer(.rowSums(cc, 3L, length(data))) # can delete 'as.integer' if worried about integer overflow ...
names(res) <- levels
res
}
f2(data, c("0/1", "1/1", "0/0"))
Here is a test using a data frame with 1 million rows and 100 variables:
s <- c("0/1", "1/1", "0/0")
set.seed(1L)
data <- as.data.frame(replicate(100L, sample(s, 1e+06L, TRUE), simplify = FALSE))
f1(data, s)
## 0/1 1/1 0/0
## 33329488 33332464 33338048
f2(data, s)
## 0/1 1/1 0/0
## 33329488 33332464 33338048
microbenchmark::microbenchmark(f1(data, s), f2(data, s))
## Unit: seconds
## expr min lq mean median uq max neval
## f1(data, s) 2.883588 2.956380 3.172275 3.114462 3.342997 3.724857 100
## f2(data, s) 1.170202 1.185615 1.203229 1.194077 1.207591 1.328175 100