44

I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).

set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")

The data frame my.df looks like this

> head(my.df)

  A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3

I tried

require(data.table)
my.dt <- as.data.table(my.df)

my.dt[, row.min:=0]  # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
  for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)

On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table. Am I using data.table correctly? Is there a more efficient way to do simple row-wise opertations?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Ryogi
  • 5,497
  • 5
  • 26
  • 46

3 Answers3

56

Or, just pmin.

my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user  system elapsed 
# 0.02    0.00    0.01 
head(my.dt)
#      A B C row.min
# [1,] 2 5 2       2
# [2,] 4 5 5       4
# [3,] 1 5 3       1
# [4,] 4 4 3       3
# [5,] 3 5 5       3
# [6,] 1 5 3       1
andschar
  • 3,504
  • 2
  • 27
  • 35
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 3
    Sorry if this is hijacking, but if I am dynamically creating the list of columns, e.g. any column that is a POSIXct type, or whose name matches an expression, giving me a variable such as: x=c("A","B","C") ... how do I use "x" in the pmin expression? It is using the literal values of x rather than scoping them out to the values in the columns. – gkaupas Nov 06 '12 at 22:54
  • 13
    @gkaupas That would make a good question. Wild untested stab: `DT[,do.call(pmin,.SD),.SDcols=x]` or perhaps `do.call(pmin,lapply(x,get))`. The first way using `.SDcols` should be more efficient. You don't have to group to use `.SD`. You can add a `by` clause as well if you need to, without changing `j`. – Matt Dowle Nov 07 '12 at 00:08
  • 3
    Thank you Matthew. The first worked, and I needed to add na.rm=TRUE to it, which took a little trial-and-error, but I ended up with `DT <- DT[,NEWCOLNAME:=do.call(pmin, c(.SD, na.rm=TRUE)), .SDcols=x]`. I was wondering where `.SDcols` was documented and found it in the [NEWS](https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable) for release 1.6.3, but not in any of the help() topics. – gkaupas Nov 07 '12 at 16:55
  • @gkaupas Great. I just checked `?data.table` and `.SDcols` is documented there (search the help page for "SDcols"). – Matt Dowle Nov 07 '12 at 16:59
  • Apologies, of all the obvious places to look, that was the one I didn't check. I was trying things like ??.SDcols and even just ??SD and not getting any relevant hits. – gkaupas Nov 07 '12 at 17:40
  • @gkaupas No problem at all. I'm not sure what `??` does, but if there is a way to search all a package's documentation (all help files & vignettes) for a particular string, then I'd like to know, too! – Matt Dowle Nov 07 '12 at 20:26
  • `help("??")` directs to `help.search()` which will search documentation "matching a given character string in the (file) name, alias, title, concept or keyword entries"... i.e. not the entire help text. Presumably `.SDcols` isn't in the concept or keywords, so it doesn't show up. The rules for searching a vignette are: "The 'name' and 'alias' are both the base of the vignette filename, and the 'concept' entries are taken from the \VignetteKeywords entries. Vignettes are not classified using the help system "keyword" classifications." – gkaupas Nov 13 '12 at 00:50
27

After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:

  • pmin (Matt Dowle's answer above), below as tm1
  • apply (Andrie's answer above), below as tm2
  • melting first, then min by group, below as tm3

so:

library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())

for (m in c(2.5,100)*1e5){

  for (n in c(3,50)){
    my.df <- sample(1:5, m*n, replace=TRUE)
    dim(my.df) <- c(m,n)    
    my.df <- as.data.frame(my.df)
    names(my.df) <- c(LETTERS,letters)[1:n]   
    my.dt <- as.data.table(my.df)

    tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)
    tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)sv
    tm3 <- mean(microbenchmark(
                melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id], 
                times=30L
               )$time)/1e6
    b <- rbind(b, data.table(m, n, tm1, tm2, tm3) ) 
  }
}

(I ran out of time to try more combinations) gives us:

b
#          m  n        tm1       tm2         tm3
# 1: 2.5e+05  3   16.20598  1000.345    39.36171
# 2: 2.5e+05 50  166.60470  1452.239   588.49519
# 3: 1.0e+07  3  662.60692 31122.386  1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))

library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
  ylab('Runtime (millisec)') + xlab('# of rows') +  
  guides(linetype=guide_legend(title='Number of columns'))

enter image description here

Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • You don't need to create `id` here, you could just do `melt(my.dt, measure.vars = names(my.dt))` – David Arenburg Dec 01 '15 at 12:12
  • @DavidArenburg `melt(my.dt, measure.vars = names(my.dt))` returns a data.table with two columns, `variable` (A,B,C) and `value`. Without `id` I don't know how to group by the original row number (of the unmelted data.table). – C8H10N4O2 Dec 01 '15 at 13:20
  • 1
    3 years later, this is still the best answer to this question in general. I wish I could upvote multiple times. – jzadra Apr 30 '18 at 20:47
24

The classical way of doing row-wise operations in R is to use apply:

apply(my.df, 1, min)
> head(my.df)
  A B C min
1 2 5 4   2
2 4 3 1   1
3 1 1 5   1
4 4 1 5   1
5 3 3 4   3
6 1 1 1   1

On my machine, this operation takes about 0.25 of a second.

Andrie
  • 176,377
  • 47
  • 447
  • 496