14

My goal is to sum all values in columns that start with the prefix skill_ in a data.table. I would prefer a solution using data.table but I am not picky.

My solution up to now:

> require(data.table)
> DT <- data.table(x=1:4, skill_a=c(0,1,0,0), skill_b=c(0,1,1,0), skill_c=c(0,1,1,1))
> DT[, row_idx := 1:nrow(DT)]
> DT[, count_skills := 
          sapply(1:nrow(DT), 
                 function(id) sum(DT[row_idx == id, 
                                     grepl("skill_", names(DT)), with=FALSE]))]

> DT
   x skill_a skill_b skill_c row_idx count_skills
1: 1       0       0       0       1            0
2: 2       1       1       1       2            3
3: 3       0       1       1       3            2
4: 4       0       0       1       4            1

But this becomes very slow when DT is very large. Is there a more efficient way to do this?

Rodrigo
  • 473
  • 4
  • 14

4 Answers4

19

A question about efficiency and performance always deserves benchmarks...

The size of your data is important as growth rate makes a huge difference...

Relative Times Relative Benchmark Timings between 2^4 and 2^24.
Sizes along floor( 2^logb(10^( seq( 4, 24, .5 ) ), 10 ) )

Excerpt of benchmarks at 1 million rows...

## Unit: milliseconds
##             expr    min     lq median    uq   max neval
##    dplyr.sol(DT) 21.803 50.260 51.765 52.45 73.30   100
##  rowSums.sol(DT) 20.759 50.224 51.418 52.56 96.28   100
##   SDCols.sol(DT)  7.250  8.916 37.699 38.50 52.69   100
##     eval.sol(DT)  6.883  7.007  7.916  9.45 50.91   100

eval.sol is an answer that takes advantage of data.table's handling of expressions, in the below source...

library(compiler)
library(data.table)
suppressMessages(library(dplyr))
library(microbenchmark)

buildDT <- function(reps) {
  data.table(x=seq_len(reps*4),
               skill_a=rep(c(0,1,0,0),reps),
               skill_b=rep(c(0,1,1,0),reps),
               skill_c=rep(c(0,1,1,1),reps))
}

OP.sol <- function(DT) {
  DT[, row_idx := 1:nrow(DT)]
  DT[, count_skills := 
          sapply(1:nrow(DT), 
                 function(id) sum(DT[row_idx == id, 
                                     grepl("skill_", names(DT)), with=FALSE]))]
}

dplyr.sol <- function(DT)
  DT %.% select(starts_with("skill_")) %.% rowSums()

SDCols.sol <- function(DT)
  DT[, Reduce(`+`, .SD),
     .SDcols = grep("skill_", names(DT), value = T)]

rowSums.sol <- function(DT)
  rowSums(DT[,grep("skill_", names(DT)),with=FALSE])

eval.sol <- function(DT) {
  cmd <- parse(text=paste(colnames(DT)[grepl("^skill_", colnames(DT))],collapse='+') )
  DT[,eval(cmd)]
}

DT <- buildDT(1)
identical(OP.sol(DT)$count_skills, dplyr.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, rowSums.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, SDCols.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, eval.sol(DT))

## [1] TRUE

DT<-buildDT(2500)
nrow(DT)

## [1] 10000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: microseconds
##             expr   min    lq median    uq   max neval
##    dplyr.sol(DT) 760.1 809.0  848.2 951.5  2276   100
##  rowSums.sol(DT) 580.5 605.3  627.6 745.7 28481   100
##   SDCols.sol(DT) 559.8 610.5  638.8 694.0  2016   100
##     eval.sol(DT) 636.4 677.7  692.4 740.5  2021   100

DT<-buildDT(25000)
nrow(DT)

## [1] 100000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: milliseconds
##             expr   min    lq median    uq   max neval
##    dplyr.sol(DT) 2.668 3.744  4.045 4.573 33.87   100
##  rowSums.sol(DT) 2.455 3.339  3.756 4.235 34.19   100
##   SDCols.sol(DT) 1.253 1.401  2.179 2.392 31.72   100
##     eval.sol(DT) 1.294 1.427  2.116 2.484 32.02   100

DT<-buildDT(250000)
nrow(DT)

## [1] 1000000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: milliseconds
##             expr    min     lq median    uq   max neval
##    dplyr.sol(DT) 21.803 50.260 51.765 52.45 73.30   100
##  rowSums.sol(DT) 20.759 50.224 51.418 52.56 96.28   100
##   SDCols.sol(DT)  7.250  8.916 37.699 38.50 52.69   100
##     eval.sol(DT)  6.883  7.007  7.916  9.45 50.91   100

identical(dplyr.sol(DT), rowSums.sol(DT))

## [1] TRUE

identical(dplyr.sol(DT), SDCols.sol(DT))

## [1] TRUE

identical(dplyr.sol(DT), eval.sol(DT))

## [1] TRUE
Thell
  • 5,883
  • 31
  • 55
  • 2
    Thanks for the benchmarking! The comparison is bit unfair against `SDCols.sol`. It is loosing some time on adding the variable `count_skills` what is not done in other solutions. So the correct version would be `DT[, Reduce(`+`, .SD), .SDcols = grep("skill_", names(DT), value = T)]`. And the timing is bit unstable if you run it only 10 times. I did the last benchmark with `times=100` and updated `SDCols.sol` and I got median timings like 35.2 (`SDCols.sol`) and 32.4 (`eval.sol`). The difference is not so big any more. – djhurio Apr 22 '14 at 19:52
  • 1
    @djhurio, updated with evals @ 100. Not sure how your medians got so close... My machine specs are 64bit core i7 470MQ w/ 24GB mem. – Thell Apr 22 '14 at 20:58
  • The results I got were done on a bit old machine with Intel Core 2 CPU. I rerun the code now on different machine (i5) and got the results close to yours 44.3 and 9.1. Interesting observation. – djhurio Apr 23 '14 at 05:17
  • I ran the code on 3rd machine (Intel Xeon E5-4640 2.40GHz) and again the median times for `CDCols.sol` and `eval.sol` are marginally different. Two time from three complete runs the `CDCols.sol` was faster. – djhurio Apr 23 '14 at 11:49
  • The OS for Core2 and Xeon machines is linux. The OS for i5 machine is Windows. What is your OS? Probably it is OS dependent. Or there is some architecture difference in i5/i7 processors that make the results so different. – djhurio Apr 23 '14 at 11:52
  • I'd guess architecture too, but without an idea of Roger's details... Like the comment to the original post and the second line of this post state, the size matters. I can't imagine using the `eval.sol` unless the dt is _huge_. – Thell Apr 23 '14 at 13:23
  • Old thread, but still the top google hit. I'd add that the two methods that include `rowSums()` easily allow you to include `na.rm = TRUE` if desired, which is a little more difficult with the methods using `+`. – qdread Feb 17 '21 at 11:48
15

Here is a dplyr solution:

library(dplyr)

DT %>% mutate(count = DT %>% select(starts_with("skill_")) %>% rowSums())
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
12

Why not to use rowSums, It is generally efficient:

DT[, rowSums(.SD), .SDcols=patterns("skill_")]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Thanks, that solves my problem! I am not sure why the code is not reproducible. Missing a require(data.table)? I'll add that. – Rodrigo Apr 22 '14 at 14:25
  • @RogerBill sorry, the code was reproducible ( I didn't a line from your code) – agstudy Apr 22 '14 at 14:26
9

Solution using data.table and .SDcols.

require(data.table)

DT <- data.table(x=1:4, skill_a=c(0,1,0,0), skill_b=c(0,1,1,0),
                 skill_c=c(0,1,1,1))

DT[, row_idx := 1:nrow(DT)]

DT[, count_skills := Reduce(`+`, .SD), .SDcols = patterns("skill_")]
DT
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
djhurio
  • 5,437
  • 4
  • 27
  • 48