0

I need to find a very efficient way to calculate the number of times each of 3 specific values occurs in a data frame. Here is what my data frame looks like:

enter image description here

The only values that can be found there are "0/1", "1/1", and "0/0". I want the output to be in a form of 3 different variables containing the respective number of occurrences.

Edit:

As mentioned in the comments, I tried to use table(unlist(DF)), but I reckon my data frame is too big.

enter image description here

Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
antekkalafior
  • 274
  • 2
  • 16

2 Answers2

2

If data has modest dimensions, then you can do:

f1 <- function(data, levels) {
  c(table(factor(unlist(data, FALSE, FALSE), levels)))
}
f1(data, c("0/1", "1/1", "0/0"))

If not, then you may need a different function, because f1 requires you to allocate memory for five prod(dim(data))-length vectors: the unlist result, the factor result, and three intermediate objects inside of table ([1], [2], [3]).

f2 below is more verbose but much more efficient:

  • It computes column-wise counts then takes their sum to obtain the result. In this way, it avoids creating vectors of length greater than nrow(data).
  • It uses tabulate instead of table to do the counting. You can think of tabulate as a low level analogue of table. With some care, you can use tabulate to obtain the table result without any of the associated overhead.
f2 <- function(data, levels) { 
  tt <- function(x, levels) tabulate(factor(x, levels), length(levels))
  cc <- vapply(data, tt, integer(3L), levels, USE.NAMES = FALSE)
  res <- as.integer(.rowSums(cc, 3L, length(data))) # can delete 'as.integer' if worried about integer overflow ...
  names(res) <- levels
  res
}
f2(data, c("0/1", "1/1", "0/0"))

Here is a test using a data frame with 1 million rows and 100 variables:

s <- c("0/1", "1/1", "0/0")

set.seed(1L)
data <- as.data.frame(replicate(100L, sample(s, 1e+06L, TRUE), simplify = FALSE))

f1(data, s)
##      0/1      1/1      0/0 
## 33329488 33332464 33338048

f2(data, s)
##      0/1      1/1      0/0 
## 33329488 33332464 33338048

microbenchmark::microbenchmark(f1(data, s), f2(data, s))
## Unit: seconds
##         expr      min       lq     mean   median       uq      max neval
##  f1(data, s) 2.883588 2.956380 3.172275 3.114462 3.342997 3.724857   100
##  f2(data, s) 1.170202 1.185615 1.203229 1.194077 1.207591 1.328175   100
Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48
1

As suggested by RobZ in the comments, table(unlist(df, use.names = F)) is quite speedy, but using a matrix or vector-like structure is faster. Seeing this SO post, and inspecting table, it calls tabulate under the hood. We can dig into it a little if we are sure no edge cases exist.

bench::mark(
  one = table(unlist(df, use.names = F)),
  two = .Internal(tabulate(
    unlist(df, use.names = F ) |> # make a vector
      (\(.) ifelse(. == "0/0", 1L, ifelse(. == "0/1", 2L, 3L)))(), # converting to integer
    3)), check = F
)

# A tibble: 2 x 13
  expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 one          62.8us    67us    13689.        0B     6.26  6555     3      479ms
2 two          11.1us  12.3us    76757.        0B     7.68  9999     1      130ms

Since we defined the "factor levels" ourselves, we can infer the meaning of the numbers. Note that the output is not the same, it is the cost of speed versus convenience. Another thing to consider is premature optimization: the code in this post is just for educational purposes, base R functions are usually quite optimized for the general use case, minor tweaks probbaly take longer to implement than they save on top of the risk of errors if edge cases are actually present.

akrun
  • 874,273
  • 37
  • 540
  • 662
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • Your nested `ifelse` is an anti-optimization. `factor` is much faster in most nontrivial applications, especially when you know the levels. Note also that base R has a `tabulate` function that calls `.Internal` earlier than `table`. Try comparing (1) `table(x)`, (2) `tabulate(factor(x, levels = s), 3L)`, (3) `.Internal(tabulate(factor(x, levels = s), 3L))`, and (4) `.Internal(tabulate(ifelse(x == "0/1", 1L, ifelse(x == "1/1", 2L, 3L)), 3L))`, where `s <- c("0/1", "1/1", "0/0")` and `x <- sample(s, 1e+06, TRUE)`. You should find that (2) ~ (3) < (1) < (4). – Mikael Jagan Jan 28 '22 at 18:32
  • I see no reason not to use `table` here. The bottleneck in the linked thread is not due to `table` but due to `factor()`, which requires a call to `unique` and a coercion to character. `factor(, levels)` is fast. – Mikael Jagan Jan 28 '22 at 18:40
  • 1
    @MikaelJagan Valid observations, `table` suffices (as mentioned in the post and comments). I think input should not be immutable, especially in a question aimed at efficiency. As for integer vs factor, `tabulate` etc, consider `set.seed(1); y = factor(sample(c("0/0", "0/1", "1/1"), 1e6, T)); z = ifelse(y == "0/0", 1L, ifelse(y == "0/1", 2L, 3L))`, then compare `bench::mark(table(y), .Internal(tabulate(z, 3)), check = F)`, is that an unfair benchmark? – Donald Seinen Jan 28 '22 at 19:18
  • 1
    I see now - I wrote my last comment not realizing that the OP was running out of memory. I was thinking only about speed. Your example is a good one. For factor `x`, `tabulate(x, nlevels(x))` is both faster and more efficient than `table(x)`. If `length(x) <= 1e+08` (say), then the difference may not matter much in an interactive session. If `x` is longer, then `tabulate` is worth thinking about (and may be the only option, depending on memory constraints). – Mikael Jagan Jan 29 '22 at 04:02
  • 1
    Since `length(x)` matters, it would help, maybe, to show in your answer how you created `df`. – Mikael Jagan Jan 29 '22 at 04:13