R: scan vectors once instead of 4 times?

Question

Suppose I have two equal length logical vectors. Computing the confusion matrix the easy way:

c(sum(actual == 1 & predicted == 1),
  sum(actual == 0 & predicted == 1),
  sum(actual == 1 & predicted == 0),
  sum(actual == 0 & predicted == 0))

requires scanning the vectors 4 times.

Is it possible to do that in a single pass?

PS. I tried table(2*actual+predicted) and table(actual,predicted) but both are obviously much slower.

PPS. Speed is not my main consideration here, I am more interested in understanding the language.

Perhaps you could try `data.table`. ie. `DT <- data.table(actual, predicted); setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N` — akrun, Jan 14 '15 at 17:32
May be this link helps http://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table — akrun, Jan 14 '15 at 17:40
Not as fast as `data.table` might be, but not too shabby either `data_frame(actual, predicted) %>% group_by(actual, predicted) %>% summarise(n())` — Khashaa, Jan 14 '15 at 17:51
You need `dplyr` to run this. `install.packages("dplyr"); library(dplyr)` — Khashaa, Jan 14 '15 at 17:55
You can also use dplyr's `count` function if you want to know the group sizes: `data_frame(actual, predicted) %>% count(actual, predicted)`. — talat, Jan 14 '15 at 18:12

akrun · Answer 1 · 2015-01-17T17:35:31.150

You could try using data.table

library(data.table)
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N

data

set.seed(24)
actual <- sample(0:1, 10 , replace=TRUE)
predicted <- sample(0:1, 10, replace=TRUE)

Benchmarks

Using data.table_1.9.5 and dplyr_0.4.0

library(microbenchmark)
set.seed(245)
actual <- sample(0:1, 1e6 , replace=TRUE)
predicted <- sample(0:1, 1e6, replace=TRUE)
f1 <- function(){
  DT <- data.table(actual, predicted)
  setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N}

f2 <- function(){table(actual, predicted)}
f3 <- function() {data_frame(actual, predicted) %>%
                      group_by(actual, predicted) %>% 
                      summarise(n())}

microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
# expr       min        lq      mean   median        uq       max neval cld
#f1()  1.000000  1.000000  1.000000  1.00000  1.000000  1.000000    20  a 
#f2() 20.818410 22.378995 22.321816 22.56931 22.140855 22.984667    20   b
#f3()  1.262047  1.248396  1.436559  1.21237  1.220109  2.504662    20  a

Including the count from dplyr and tabulate also in the benchmarks on a slightly bigger dataset

set.seed(498)
actual <- sample(0:1, 1e7 , replace=TRUE)
predicted <- sample(0:1, 1e7, replace=TRUE)
f4 <- function() {data_frame(actual, predicted) %>% 
                       count(actual, predicted)}
f5 <- function(){tabulate(4-actual-2*predicted, 4)}

Update

Including another data.table solution (provided by @Arun) also in the benchmarks

f6 <- function() {setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N}

microbenchmark(f1(),  f3(), f4(), f5(), f6(),  unit='relative', times=20L)
#Unit: relative
#expr      min       lq     mean   median       uq      max neval  cld
#f1() 2.003088 1.974501 2.020091 2.015193 2.080961 1.924808    20   c 
#f3() 2.488526 2.486019 2.450749 2.464082 2.481432 2.141309    20    d
#f4() 2.388386 2.423604 2.430581 2.459973 2.531792 2.191576    20    d
#f5() 1.034442 1.125585 1.192534 1.217337 1.239453 1.294920    20  b  
#f6() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20 a

How does this compare with `plyr::count` or the `dplyr` solution suggested by Khashaa? — krlmlr, Jan 14 '15 at 18:01
@sds Looks like it is faster because it doesn't need the conversion to `data.table/data.frame` along with `grouping`. — akrun, Jan 14 '15 at 19:02
The data.table solution could be just: `setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N`. — Arun, Jan 17 '15 at 17:26
@Arun Thanks for the comment. I will include that in the benchmarks. — akrun, Jan 17 '15 at 17:28

score 5 · Accepted Answer · answered Jan 14 '15 at 18:54

5

Like this:

tabulate(4 - actual - 2*predicted, 4)

(tabulate here is much faster than table because it knows the output will be a vector of length 4).

answered Jan 14 '15 at 18:54

flodel

87,577
21
185
223

krlmlr · Answer 3 · 2015-01-14T18:01:07.607

2

There is table which computes a cross tabulation and should give similar results if actual and predicted contain only zeros and ones:

table(actual, predicted)

Internally, this works by pasteing the vectors -- horribly inefficient. It seems that the coercion to character also happens when tabulating only one value, and this might be the very reason for the bad performance also of table(actual*2 + predicted).

edited Jan 14 '15 at 18:01

answered Jan 14 '15 at 17:27

krlmlr

25,056
14
120
217

still slow. Is this the best I can hope for? – sds Jan 14 '15 at 17:30
@sds: How long are your vectors? What do they contain? – krlmlr Jan 14 '15 at 17:44
logical vectors of length 3135417 – sds Jan 14 '15 at 17:52

R: scan vectors once instead of 4 times?

3 Answers3

data

Benchmarks

Update