6

When attempting to get a "count" for a specified number of observations, it appears that each of these functions work, as they derive the same result. But how are these functions operating differently in the background, and in what scenarios would it not be appropriate to swap one for the other?

sum(grade.data$Quiz >= (100*.45))

length(which(grade.data$Quiz >= (100*.45)))

nrow(grade.data[grade.data$Quiz >= (100*.45),])
SocraticDatum
  • 349
  • 2
  • 4
  • 15
  • Yet another idea: `table(grade.data$Quiz >= (100*.45))[["TRUE"]]` – alexis_laz Jul 31 '14 at 17:56
  • 2
    just use the first one. of course you can get the same answer many ways, but using the second and third are just convoluted answers to a simple question. The second line performs the logical and then asks which indices are true and subsets those, and THEN counts the dimensions. The third does the logical and then creates new data frames (presumably) and then counts the rows. The first sums a vector of 0s and 1s. Why make it more complicated than it needs to be?... as for the whole benchmarking below... eh, really? – rawr Jul 31 '14 at 19:07

2 Answers2

6

The middle one will not give misleading answers when there are missing values. Both of the other ones will.

Number 1 sums a logical vector that is coerced to 1's and 0's. If you added na.rm it would be valid when NA's are present.

Number 2 determines the length of a numeric vector.

Number three constructs a subset and then counts the rows. I would expect it to be rather inefficient compared to the other two as well as having the problem with NA values. If you added & !is.na(grade.data$Quiz) to the logical expression inside [ , ], you would get valid answers.

A fourth method like the third (and also inefficient) without the NA problem would be:

nrow( subset( grade.data, Quiz >= (100*.45) ) )
IRTFM
  • 258,963
  • 21
  • 364
  • 487
2

Let's generate 100k row data.frame to see which method is fastest.

grade.data = data.frame(Quiz = sample(100000), age = sample(18:24, 100000, replace = TRUE))
library(data.table)
dt.grade.data = as.data.table(grade.data)

The methods posted here

data.table = function(x) dt.grade.data[,sum(Quiz>=100*.45)]
logical.sum = function(x) sum(grade.data$Quiz >= (100*.45))
logical.counting.table = function(x) table(grade.data$Quiz >= (100*.45))[["TRUE"]]
logical.which = function(x) length(which(grade.data$Quiz >= (100*.45)))
subsetting = function(x) nrow(grade.data[grade.data$Quiz >= (100*.45),])
subset.cmd = function(x) nrow(subset(grade.data, Quiz >= (100*.45) ))

Benchmark

microbenchmark(data.table(), logical.sum(), logical.counting.table(), logical.pointless.which(), subsetting(), subset.cmd(), times = 100L)

Unit: microseconds
                   expr       min       lq         median    uq         max        neval
            data.table()      1766.148  2188.8000  2308.267  2469.405  29185.36   100
           logical.sum()       739.385   945.4765   993.921  1074.386  10253.67   100
logical.counting.table()     28867.605 30847.0290 31546.796 32725.255  65514.14   100
         logical.which()       701.205  1080.9555  1138.635  1228.545   3565.96   100
            subsetting()     27376.931 28406.7730 29243.866 30564.371 168034.45   100
            subset.cmd()     29004.315 31203.1730 32219.878 33362.003  89801.34   100

Seems that a vectorized logical check is the fastest method. In a smaller data frame (500 rows). data.table is actually much slower than all the other methods.

edit: Apparently, relatively efficiency of logical.sum() and logical.which() depends on the data structure. Using different Quiz score distribution can make the logical.sum() the fastest method. And as expected, data.table selection/subsetting blows data.frame subsetting out of the water.

Vlo
  • 3,168
  • 13
  • 27