1

I'm trying to do a row apply in data.table and can't get it to work. How do I do so?

library(data.table)
data(diamonds, package= "ggplot2")
dt <- data.table(diamonds)

# what I want, but via data.table
diamonds$sum1 <- apply(dt[,5:10, with=FALSE], 1, sum)
diamonds$sd1 <- apply(dt[,5:10, with=FALSE], 1, sd)

# why don't these work?
dt[, `:=` (sum1= sum(.SD), 
           sd1= sd(.SD)), .SDcols= 5:10, by= .EACHI]
dt[, `:=` (sum1= sum(dt[,5:10, with=FALSE]),
           sd1= sd(dt[,5:10, with=FALSE])), by= .EACHI]

Both give this error:

Error in [.data.table(dt, , :=(sum1 = sum(.SD), sd1 = sd(.SD)), .SDcols = 5:10, : object 'f__' not found

related but not the same questions: (1), (2)

Community
  • 1
  • 1
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • 1
    There is an FR somewhere for `by=.EACHI` to do what you want, but at the moment it only has meaning when joining. Use `by = 1:nrow(dt)` instead if you must, but obviously avoid doing that at all costs. – eddi Jul 14 '16 at 20:39
  • 1
    Try avoiding doing by row stuff in general. You could simply convert to long format using `melt` and then operate on a single column. Or you could do `rowMeans` or check out the `matrixStats` that has many other vectorized by row operations. Or write [your own](http://stackoverflow.com/a/25100036/3001626) vectorized function. Or use Rcpp. As I see it, R is a language that you always need to invent smart tricks in order to solve your problems efficiently, rather taking the way that at first site make the most sense. – David Arenburg Jul 14 '16 at 20:50
  • @DavidArenburg Fair enough points. But the dev time to write this in Rcpp isn't worth it when I can just use the `data.frame` framework – alexwhitworth Jul 14 '16 at 20:51
  • There are many other solutions that were already written (as I've mentioned). And as I said, you could always try `melt`ing first. It should be very efficient even for huge data sets. – David Arenburg Jul 14 '16 at 20:53
  • 1
    To reiterate David's point - both by-row `dt` and `apply` are the wrong way to do this (including in plain `data.frame`) for these particular functions. Interpretive languages will always be slow when looping in the interpretive function space (here that means calling `sum` and `sd` for each element of your loop) - you'll be much better doing the loop in a compiled function instead (like `rowMeans` or other options David mentioned). – eddi Jul 14 '16 at 21:33
  • 1
    Given that the fast solutions seem to be complicated, perhaps we should give a simple, easy-to-read, and correct solution first - even if it is slow on large datasets. Users with small data would appreciate that. A good answer could then include the efficient solutions also if appropiate – Aaron McDaid Jul 15 '16 at 07:12

1 Answers1

0

This uses the Rcpp package in order to write a fast function in C++ which is then available to call from R, but I think it's reasonably clear and maintainable.

library(Rcpp)

cppFunction('Rcpp::NumericVector SDrowSums(Rcpp::DataFrame SD) {
    Rcpp:: NumericVector sums;
    for(int i = 0; i<SD.nrows(); ++i) { // each row of .SD
        sums.push_back(  0.0
            + as<NumericVector>(SD[4]) [i]
            + as<NumericVector>(SD[5]) [i]
            + as<IntegerVector>(SD[6]) [i] // IntegerVector, or will be slow!
            + as<NumericVector>(SD[7]) [i]
            + as<NumericVector>(SD[8]) [i]
            + as<NumericVector>(SD[9]) [i]
        );
    }
    return sums;
}');

dt[, "sumNew" := SDrowSums(.SD) ,by=seq_len(nrow(dt))]

The final line calls the function SDrowSums once for each row of the table (by=seq_len(nrow(dt))).

For me, that runs in about 0.15 seconds. Not as fast as rowSums (0.01s for me), but the same speed as your original code: diamonds$sum1 <- apply(dt[,5:10, with=FALSE], 1, sum).

You can even replace the (zero-based) column offsets (4,5,6,7,8,9) with the names ("depth", "table", ...). This slows it down a little but is more readable.

In practice here, SDrowSums will be called one for each row, with nrow(.SD)==1, but this cpp function is a bit more general to allow multi-row SD.

Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88
  • 1
    Thanks! I think an inner loop would improve your solution (making it more generic). `... for (int j = 0; n < SD.ncol(); j++) {...}` ... and also, this is only a partial solution, since it doesn't include an `SDrowStd` function – alexwhitworth Jul 16 '16 at 16:01