11

I use the following command:

table(factor("list",levels=1:"n")

with "list": (example) a = c(1,3,4,4,3) and levels = 1:5, to also take the 2 and 5 into consideration. For really big datasets, my code seems to be very ineffective.

Does anyone know a hidden library or a code snippet to make it faster?

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
elmoBlue
  • 145
  • 5

7 Answers7

14

We could use fnobs from collapse which would be efficient

library(collapse)
fnobs(df, g = df$X1)

In base R, tabulate is more efficient compared to table

 tabulate(df$X1)
 [1]  9  6 15 13 11  9  7  9 11 10
akrun
  • 874,273
  • 37
  • 540
  • 662
10

We could also use janitor::tabyl:

library(janitor)

df %>%
  tabyl(X1) %>%
  adorn_totals()

    X1   n percent
     1   9    0.09
     2   6    0.06
     3  15    0.15
     4  13    0.13
     5  11    0.11
     6   9    0.09
     7   7    0.07
     8   9    0.09
     9  11    0.11
    10  10    0.10
 Total 100    1.00
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
10

TL;DR the winner is base::tabulate.

Summing up, the base objective was a performance so I prepared a microbenchmark of all provided solutions. I use small and bigger vectors, two different scenerio. For collapse package on my machine I have to download the newest Rcpp package 1.0.7 (to suppress crashes). Even added by me Rcpp solution is slower than base::tabulate.

suppressMessages(library(janitor))
suppressMessages(library(collapse))
suppressMessages(library(dplyr))
suppressMessages(library(cpp11))

# source https://stackoverflow.com/questions/31001392/rcpp-version-of-tabulate-is-slower-where-is-this-from-how-to-understand
Rcpp::cppFunction('IntegerVector tabulate_rcpp(const IntegerVector& x, const unsigned max) {
    IntegerVector counts(max);
    for (auto& now : x) {
        if (now > 0 && now <= max)
            counts[now - 1]++;
    }
    return counts;
}')

set.seed(1234)

a = c(1,3,4,4,3)
levels = 1:5
df <- data.frame(X1 = a)


microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
                               base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
                               stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
                               graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
                               janitor_tably = {adorn_totals(tabyl(df, X1))},
                               collapse_fnobs = {fnobs(df, df$X1)},
                               base_tabulate = {tabulate(df$X1)},
                               dplyr_count = {count(df, X1)})
#> Unit: microseconds
#>             expr      min        lq       mean    median        uq       max
#>    tabulate_rcpp    2.959    5.9800   17.42326    7.9465    9.5435   883.561
#>       base_table   48.524   59.5490   72.42985   66.3135   78.9320   153.216
#>  stats_aggregate  829.324  891.7340 1069.86510  937.4070 1140.0345  2883.025
#>    graphics_hist  148.561  170.5305  221.05290  188.9570  228.3160   958.619
#>    janitor_tably 6005.490 6439.6870 8137.82606 7497.1985 8283.3670 53352.680
#>   collapse_fnobs   14.591   21.9790   32.63891   27.2530   32.6465   417.987
#>    base_tabulate    1.879    4.3310    5.68916    5.5990    6.6210    16.789
#>      dplyr_count 1832.648 1969.8005 2546.17131 2350.0450 2560.3585  7210.992
#>  neval
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100


df <- data.frame(X1 = sample(1:5, 1000, replace = TRUE))

microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
                               base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
                               stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
                               graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
                               janitor_tably = {adorn_totals(tabyl(df, X1))},
                               collapse_fnobs = {fnobs(df, df$X1)},
                               base_tabulate = {tabulate(df$X1)},
                               dplyr_count = {count(df, X1)})
#> Unit: microseconds
#>             expr      min        lq       mean    median        uq       max
#>    tabulate_rcpp    4.847    8.8465   10.92661   10.3105   12.6785    28.407
#>       base_table   83.736  107.2040  121.77962  118.8450  129.9560   184.427
#>  stats_aggregate 1027.918 1155.9205 1338.27752 1246.6205 1434.8990  2085.821
#>    graphics_hist  209.273  237.8265  274.60654  258.9260  300.3830   523.803
#>    janitor_tably 5988.085 6497.9675 7833.34321 7593.3445 8422.6950 13759.142
#>   collapse_fnobs   26.085   38.6440   51.89459   47.8250   57.3440   333.034
#>    base_tabulate    4.501    6.7360    8.09408    8.2330    9.2170    11.463
#>      dplyr_count 1852.290 2000.5225 2374.28205 2145.9835 2516.7940  4834.544
#>  neval
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100

Created on 2021-08-01 by the reprex package (v2.0.0)

polkas
  • 3,797
  • 1
  • 12
  • 25
  • 1
    Great benchmarking! Upvoted for your excellent work! – ThomasIsCoding Aug 01 '21 at 13:34
  • That is well done. You _could_ tweak your clean and simple Rcpp function two ways we sometimes do: "loop unrolling" (jumping multiple elements at once), and/or via OpenMP. Also, there is already an Rcpp 'sugar' function `table()`. Lastly, "to prevent crashing": you should generally update all packages. With Rcpp 1.0.7 we do indeed need it present if you run packages compiled against it. Lastly, one final problem you have here: you data set is unrealistically too small. I switched to `a <- sample(1000, 10000, TRUE)` and now your `tabulate` and base R `table` are both the fastest. – Dirk Eddelbuettel Aug 01 '21 at 14:55
  • Oh, and you are not using `levels` and not testing a `factor` so the setup is slightly different. – Dirk Eddelbuettel Aug 01 '21 at 14:58
  • One more: I gave OpenMP a shot; it does not help as the results vector is shared. – Dirk Eddelbuettel Aug 01 '21 at 15:22
  • Thanks indeed. I actually never thought `janitor` would be this slow! – Anoushiravan R Aug 13 '21 at 06:23
8

It's not exactly what you are looking for, but perhaps you can use this:

library(dplyr)
set.seed(8192)

df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))

df %>% 
  count(X1)

returns

   X1  n
1   1  9
2   2  6
3   3 15
4   4 13
5   5 11
6   6  9
7   7  7
8   8  9
9   9 11
10 10 10

If you need to count more numbers (including missing ones), you could use

library(tidyr)
library(dplyr)

df2 <- data.frame(X1 = 1:12)

df %>% 
  count(X1) %>% 
  right_join(df2, by="X1") %>% 
  mutate(n = replace_na(n, 0L))

to get

   X1  n
1   1  9
2   2  6
3   3 15
4   4 13
5   5 11
6   6  9
7   7  7
8   8  9
9   9 11
10 10 10
11 11  0
12 12  0
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
6

A base R option using aggregate (borrowing df from @Martin Gal)

> aggregate(. ~ X1, cbind(df, n = 1), sum)
   X1  n
1   1  9
2   2  6
3   3 15
4   4 13
5   5 11
6   6  9
7   7  7
8   8  9
9   9 11
10 10 10

Another option is using hist

> hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]
$breaks
 [1]  1  2  3  4  5  6  7  8  9 10

$counts
[1]  9  6 15 13 11  9  7  9 21
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
6

Here is one more: summarytools

Data from Martin Gal! Many thanks:

library(summarytools)

set.seed(8192)
df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))

summarytools::freq(df$X1, cumul=FALSE)

Output:

              Freq   % Valid   % Total
----------- ------ --------- ---------
          1      9      9.00      9.00
          2      6      6.00      6.00
          3     15     15.00     15.00
          4     13     13.00     13.00
          5     11     11.00     11.00
          6      9      9.00      9.00
          7      7      7.00      7.00
          8      9      9.00      9.00
          9     11     11.00     11.00
         10     10     10.00     10.00
       <NA>      0                0.00
      Total    100    100.00    100.00
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

If a faster alternative to table() is required, including cross-tabulation, collapse::qtab(), available since v1.8.0 (May 2022) is a faithful and noticeably faster alternative. fcount() can also be used in the univariate case, and returns a data.frame.

library(collapse) # > v1.8.0, and > 1.9.0 for fcount()
library(microbenchmark)
v = sample(10000, 1e6, TRUE)

microbenchmark(qtab(v, sort = FALSE), fcount(v), tabulate(v), times = 10)
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 qtab(v, sort = FALSE) 1.911707 1.945245 2.002473 1.963654 2.027942 2.207891    10
             fcount(v) 1.885549 1.906746 1.978894 1.932310 2.103997 2.138027    10
           tabulate(v) 2.321543 2.323716 2.333839 2.328206 2.334499 2.372506    10

v2 = sample(10000, 1e6, TRUE)
microbenchmark(qtab(v, v2), qtab(v, v2, sort = FALSE), table(v, v2), times = 10)
Unit: milliseconds
                      expr       min        lq      mean   median        uq      max neval
               qtab(v, v2)  45.61279  51.14840  74.16168  60.7761  72.86385 157.6501    10
 qtab(v, v2, sort = FALSE)  41.30812  49.66355  57.02565  51.3568  54.69859 118.1289    10
              table(v, v2) 281.60079 282.85273 292.48119 286.0535 288.19253 349.5513    10

That being said, tabulate() is pretty much as fast as it gets as far as C code is concerned. But it has a clear caveat, which is that it does not hash the values at all, but determines the maximum value and allocates a results vector of that length, using it as a table to count values. Consider this:

v[10] = 1e7L # Adding a random large value here
length(tabulate(v))
[1] 10000000
length(table(v))
[1] 10001
length(qtab(v))
[1] 10001

So you get a results vector with 6.99 million zeros, and your performance deteriorates

microbenchmark(qtab(v, sort = FALSE), fcount(v), tabulate(v), times = 10)
Unit: milliseconds
                  expr      min       lq     mean   median       uq       max neval
 qtab(v, sort = FALSE) 1.873249 1.900473 1.966721 1.923064 2.064186  2.126588    10
             fcount(v) 1.829338 1.850330 1.926676 1.880199 2.021013  2.057667    10
           tabulate(v) 4.207789 4.357439 5.066296 4.417012 4.558216 10.347744    10

In light of this, the fact that qtab() actually does hash every value and achieves this performance is rather remarkable.

Sebastian
  • 1,067
  • 7
  • 12