What is the most efficient way to sum all columns whose name starts with a pattern?

Question

My goal is to sum all values in columns that start with the prefix skill_ in a data.table. I would prefer a solution using data.table but I am not picky.

My solution up to now:

> require(data.table)
> DT <- data.table(x=1:4, skill_a=c(0,1,0,0), skill_b=c(0,1,1,0), skill_c=c(0,1,1,1))
> DT[, row_idx := 1:nrow(DT)]
> DT[, count_skills := 
          sapply(1:nrow(DT), 
                 function(id) sum(DT[row_idx == id, 
                                     grepl("skill_", names(DT)), with=FALSE]))]

> DT
   x skill_a skill_b skill_c row_idx count_skills
1: 1       0       0       0       1            0
2: 2       1       1       1       2            3
3: 3       0       1       1       3            2
4: 4       0       0       1       4            1

But this becomes very slow when DT is very large. Is there a more efficient way to do this?

How _very large_ and how much slower are you talking aobut here? — Thell, Apr 22 '14 at 14:44

Thell · Accepted Answer · 2014-04-23T14:29:43.893

A question about efficiency and performance always deserves benchmarks...

The size of your data is important as growth rate makes a huge difference...

Relative Times Relative Benchmark Timings between 2^4 and 2^24.
^{_{Sizes along floor( 2^logb(10^( seq( 4, 24, .5 ) ), 10 ) )}}

Excerpt of benchmarks at 1 million rows...

## Unit: milliseconds
##             expr    min     lq median    uq   max neval
##    dplyr.sol(DT) 21.803 50.260 51.765 52.45 73.30   100
##  rowSums.sol(DT) 20.759 50.224 51.418 52.56 96.28   100
##   SDCols.sol(DT)  7.250  8.916 37.699 38.50 52.69   100
##     eval.sol(DT)  6.883  7.007  7.916  9.45 50.91   100

eval.sol is an answer that takes advantage of data.table's handling of expressions, in the below source...

library(compiler)
library(data.table)
suppressMessages(library(dplyr))
library(microbenchmark)

buildDT <- function(reps) {
  data.table(x=seq_len(reps*4),
               skill_a=rep(c(0,1,0,0),reps),
               skill_b=rep(c(0,1,1,0),reps),
               skill_c=rep(c(0,1,1,1),reps))
}

OP.sol <- function(DT) {
  DT[, row_idx := 1:nrow(DT)]
  DT[, count_skills := 
          sapply(1:nrow(DT), 
                 function(id) sum(DT[row_idx == id, 
                                     grepl("skill_", names(DT)), with=FALSE]))]
}

dplyr.sol <- function(DT)
  DT %.% select(starts_with("skill_")) %.% rowSums()

SDCols.sol <- function(DT)
  DT[, Reduce(`+`, .SD),
     .SDcols = grep("skill_", names(DT), value = T)]

rowSums.sol <- function(DT)
  rowSums(DT[,grep("skill_", names(DT)),with=FALSE])

eval.sol <- function(DT) {
  cmd <- parse(text=paste(colnames(DT)[grepl("^skill_", colnames(DT))],collapse='+') )
  DT[,eval(cmd)]
}

DT <- buildDT(1)
identical(OP.sol(DT)$count_skills, dplyr.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, rowSums.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, SDCols.sol(DT))

## [1] TRUE

identical(OP.sol(DT)$count_skills, eval.sol(DT))

## [1] TRUE

DT<-buildDT(2500)
nrow(DT)

## [1] 10000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: microseconds
##             expr   min    lq median    uq   max neval
##    dplyr.sol(DT) 760.1 809.0  848.2 951.5  2276   100
##  rowSums.sol(DT) 580.5 605.3  627.6 745.7 28481   100
##   SDCols.sol(DT) 559.8 610.5  638.8 694.0  2016   100
##     eval.sol(DT) 636.4 677.7  692.4 740.5  2021   100

DT<-buildDT(25000)
nrow(DT)

## [1] 100000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: milliseconds
##             expr   min    lq median    uq   max neval
##    dplyr.sol(DT) 2.668 3.744  4.045 4.573 33.87   100
##  rowSums.sol(DT) 2.455 3.339  3.756 4.235 34.19   100
##   SDCols.sol(DT) 1.253 1.401  2.179 2.392 31.72   100
##     eval.sol(DT) 1.294 1.427  2.116 2.484 32.02   100

DT<-buildDT(250000)
nrow(DT)

## [1] 1000000

microbenchmark( # OP.sol(DT), forget this method.
                dplyr.sol(DT),
                rowSums.sol(DT),
                SDCols.sol(DT),
                eval.sol(DT),
                times=100)

## Unit: milliseconds
##             expr    min     lq median    uq   max neval
##    dplyr.sol(DT) 21.803 50.260 51.765 52.45 73.30   100
##  rowSums.sol(DT) 20.759 50.224 51.418 52.56 96.28   100
##   SDCols.sol(DT)  7.250  8.916 37.699 38.50 52.69   100
##     eval.sol(DT)  6.883  7.007  7.916  9.45 50.91   100

identical(dplyr.sol(DT), rowSums.sol(DT))

## [1] TRUE

identical(dplyr.sol(DT), SDCols.sol(DT))

## [1] TRUE

identical(dplyr.sol(DT), eval.sol(DT))

## [1] TRUE

Thanks for the benchmarking! The comparison is bit unfair against `SDCols.sol`. It is loosing some time on adding the variable `count_skills` what is not done in other solutions. So the correct version would be `DT[, Reduce(`+`, .SD), .SDcols = grep("skill_", names(DT), value = T)]`. And the timing is bit unstable if you run it only 10 times. I did the last benchmark with `times=100` and updated `SDCols.sol` and I got median timings like 35.2 (`SDCols.sol`) and 32.4 (`eval.sol`). The difference is not so big any more. — djhurio, Apr 22 '14 at 19:52
@djhurio, updated with evals @ 100. Not sure how your medians got so close... My machine specs are 64bit core i7 470MQ w/ 24GB mem. — Thell, Apr 22 '14 at 20:58
The results I got were done on a bit old machine with Intel Core 2 CPU. I rerun the code now on different machine (i5) and got the results close to yours 44.3 and 9.1. Interesting observation. — djhurio, Apr 23 '14 at 05:17
I ran the code on 3rd machine (Intel Xeon E5-4640 2.40GHz) and again the median times for `CDCols.sol` and `eval.sol` are marginally different. Two time from three complete runs the `CDCols.sol` was faster. — djhurio, Apr 23 '14 at 11:49
The OS for Core2 and Xeon machines is linux. The OS for i5 machine is Windows. What is your OS? Probably it is OS dependent. Or there is some architecture difference in i5/i7 processors that make the results so different. — djhurio, Apr 23 '14 at 11:52
I'd guess architecture too, but without an idea of Roger's details... Like the comment to the original post and the second line of this post state, the size matters. I can't imagine using the `eval.sol` unless the dt is _huge_. — Thell, Apr 23 '14 at 13:23
Old thread, but still the top google hit. I'd add that the two methods that include `rowSums()` easily allow you to include `na.rm = TRUE` if desired, which is a little more difficult with the methods using `+`. — qdread, Feb 17 '21 at 11:48

G. Grothendieck · Answer 2 · 2017-06-01T17:01:25.833

15

Here is a dplyr solution:

library(dplyr)

DT %>% mutate(count = DT %>% select(starts_with("skill_")) %>% rowSums())

edited Jun 01 '17 at 17:01

answered Apr 22 '14 at 14:44

G. Grothendieck

254,981
17
203
341

score 12 · Answer 3 · edited Feb 15 '20 at 05:06

12

Why not to use rowSums, It is generally efficient:

DT[, rowSums(.SD), .SDcols=patterns("skill_")]

edited Feb 15 '20 at 05:06

MichaelChirico

33,841
14
113
198

answered Apr 22 '14 at 14:18

agstudy

119,832
17
199
261

Thanks, that solves my problem! I am not sure why the code is not reproducible. Missing a require(data.table)? I'll add that. – Rodrigo Apr 22 '14 at 14:25
@RogerBill sorry, the code was reproducible ( I didn't a line from your code) – agstudy Apr 22 '14 at 14:26

score 9 · Answer 4 · edited Feb 15 '20 at 05:05

9

Solution using data.table and .SDcols.

require(data.table)

DT <- data.table(x=1:4, skill_a=c(0,1,0,0), skill_b=c(0,1,1,0),
                 skill_c=c(0,1,1,1))

DT[, row_idx := 1:nrow(DT)]

DT[, count_skills := Reduce(`+`, .SD), .SDcols = patterns("skill_")]
DT

edited Feb 15 '20 at 05:05

MichaelChirico

33,841
14
113
198

answered Apr 22 '14 at 14:32

djhurio

5,437
4
27
48

Thanks! This way you don't need to add row_idx, that I was using as an auxiliary variable. – Rodrigo Apr 22 '14 at 17:04

What is the most efficient way to sum all columns whose name starts with a pattern?

4 Answers4

Linked