2

Assume the following data.frame with columns of ordered factors:

dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))

I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.

My "clean" attempt at making this table does not work. See below:

The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:

table(dat[,1])

a b c d e 
1 1 1 1 1 

table(dat[,3])

a b c d e 
2 1 1 1 0 

# note: that a 0 is provided for any factor missing

However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:

apply(dat, 2, table)

$X1

a b c d e 
1 1 1 1 1 

$X2

a b c d e 
1 1 1 1 1 

$X3

a b c d 
2 1 1 1 

$X4

b c d e 
2 1 1 1 

I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):

apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)

  X1 X2
a  1  1
b  1  1
c  1  1
d  1  1
e  1  1

Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?

  • Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
theforestecologist
  • 4,667
  • 5
  • 54
  • 91
  • @ZheyuanLi yes, I see that now. Thanks for the link! I had no idea. So it seems that `lapply` is the way to go, then. How do I convert the output from lapply into a data.frame with similar format as my last example in my question? – theforestecologist Jul 15 '22 at 15:43
  • simply using `data.frame` on my resulting lapply object didn't work actually because of the structure of the `table` output. [This post](https://stackoverflow.com/questions/10590904/extracting-outputs-from-lapply-to-a-dataframe) suggests using `do.call` in this instance. Using do.call and transpose seems to work! `t(data.frame(do.call(rbind,lapply(dat, table))))` – theforestecologist Jul 15 '22 at 15:47
  • @ZheyuanLi hmm `sapply` seems to work perfectly. Oy. Always something easy -_- – theforestecologist Jul 15 '22 at 15:56

2 Answers2

1

We may use table in sapply.

sapply(dat, table)
#   X1 X2 X3 X4
# a  1  1  2  0
# b  1  1  1  2
# c  1  1  1  1
# d  1  1  1  1
# e  1  1  0  1

Or vapply which is faster, but we need to know the .

vapply(dat, table, nlevels(unlist(dat)))
#   X1 X2 X3 X4
# a  1  1  2  0
# b  1  1  1  2
# c  1  1  1  1
# d  1  1  1  1
# e  1  1  0  1

If we don't urgently need the row names, we may use tabulate.

sapply(dat, tabulate, nlevels(unlist(dat)))
#      X1 X2 X3 X4
# [1,]  1  1  2  0
# [2,]  1  1  1  2
# [3,]  1  1  1  1
# [4,]  1  1  1  1
# [5,]  1  1  0  1

In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.

Here comes the benchmark

set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]

r <- matrix(, 5L, dim(DAT)[2])


microbenchmark::microbenchmark(
  t(data.frame(do.call(rbind,lapply(DAT, table)))),
  sapply(DAT, table),
  vapply(DAT, table, numeric(5L)),
  vapply(DAT, table, numeric(nlevels(unlist(dat)))),
  sapply(DAT, tabulate, 5L),
  sapply(DAT, tabulate, nlevels(unlist(dat))),
  `for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
  )
  Unit: microseconds
                                              expr      min         lq       mean     median         uq       max neval cld
 t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732   100   c
                                sapply(DAT, table) 9690.340  9822.2150 11721.6487  9934.2045 14128.6330 19107.070   100   c
                   vapply(DAT, table, numeric(5L)) 9630.185  9729.9155 11313.4803  9816.3260 14017.8180 22655.129   100   c
 vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252  9890.5700 11309.0461  9976.4840 14110.4775 17906.082   100   c
                         sapply(DAT, tabulate, 5L)  725.613   742.7820   778.6458   785.3595   807.1935   916.700   100 a  
       sapply(DAT, tabulate, nlevels(unlist(dat)))  848.600   891.1135   936.7825   939.8245   967.2390  1114.601   100 a  
                                               for 3580.538  3846.5700  4059.3048  3922.1300  3981.4300 19752.024   100  b 

Data:

dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c", 
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a", 
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L, 
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered", 
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a", 
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA, 
-5L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    @theforestecologist Ok, now see updated answer. – jay.sf Jul 15 '22 at 15:56
  • `sapply` does seem to be the way to go! As alternatives, I think the `do.call` approach from [my answer](https://stackoverflow.com/a/72996644/4581200) looks cleaner than your `vapply` approach, but I'm not sure which actually performs better. (moot point, though, since `sapply` is the clear winner here). Thanks! (+1) – theforestecologist Jul 15 '22 at 15:59
  • 1
    @theforestecologist I've included a benchmark. – jay.sf Jul 15 '22 at 16:42
1

Solution:

  1. Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.

    • Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
  2. Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:

    t(data.frame(do.call(rbind,lapply(dat, table))))
    
     X1 X2 X3 X4
     a  1  1  2  0
     b  1  1  1  2
     c  1  1  1  1
     d  1  1  1  1
     e  1  1  0  1
    

Or: As ZheyuanLi pointed out, one can simply use sapply(dat, table).

Also thanks jay.sf for showing how vapply works.

theforestecologist
  • 4,667
  • 5
  • 54
  • 91