6

Recently I ran into the data.table package. I'm still not sure how to do row-wise matrix operations. Was it originally intended to handle such operations? For example, what would be data.table equivalent to apply(M,1,fun)?

fun should take a vector as argument, for example mean, median, or mad.

Henrik
  • 65,555
  • 14
  • 143
  • 159
danas.zuokas
  • 4,551
  • 4
  • 29
  • 39

1 Answers1

4

I think you are looking for the := operator (see ?':='). A short example and a comparison with the mapply function is below (I hope I apply the mapply function correctly; I'm only using data.tables nowadays, so no promise on that; but still, the data.table way is fast and in my opinion easy to memorize):

library(data.table)
> df <-     data.frame(ID = 1:1e6,
+                     B  = rnorm(1e6),
+                     C  = rnorm(1e6))
> system.time(x <- mapply(foo, df$B, df$C))
   user  system elapsed 
   4.32    0.04    4.38 
> DT <- as.data.table(df)
> system.time(DT[, D := foo(B, C)])
   user  system elapsed 
   0.02    0.00    0.02 
> all.equal(x, DT[, D])
[1] TRUE

After posting my answer, I'm not so sure anymore if this is what you are looking for. I hope it does, just give more details if it doesn't (for instance, do you have many columns you want to apply a function to, not just the two in my example?). Anyways, this SO post could be of interest for you.

Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • tmp <- DT[, D := sum(B, C)]; tmp[1:2, ] gives the total sum over all elements. Does not work with mean. – danas.zuokas Mar 05 '12 at 12:24
  • http://stackoverflow.com/questions/7885147/efficient-row-wise-operations-on-a-data-table does not generalize to any function (mean). – danas.zuokas Mar 05 '12 at 12:31
  • @danas.zuokas: Good points. In those cases I guess you would just use the `row...` functions, e.g. `rowSums(DT[, list(B, C)])`, but I think it's best to leave the question open. – Christoph_J Mar 05 '12 at 12:46
  • 1
    @Danas Or just `apply(DT,1,mad)` as you do normally. That should work. We try and avoid row ops though and prefer column ops, for speed. For example, if along the columns are group names, we tend to put those groups as a key column instead (i.e. _flat_ a.k.a. _tall_ format) and do `DT[,mad(variable),by=group]`. When you need _wide_ format for regressions, you can _reshape_ then (or the data is already wide from some preliminary aggregation step). Like a database. – Matt Dowle Mar 05 '12 at 13:31
  • Thank you all for your tips. It seems that for very basic operations like row mean rowMeans is most appropriate for tall matrices. For example I use rowMeans((data-centerM)^2) style to compute variance. – danas.zuokas Mar 05 '12 at 14:47
  • Yes, and the "bare-bones .colSums(),.rowSums(),.colMeans() and .rowMeans() for use in programming where ultimate speed is required" added in R 2.15.0 – Matt Dowle Mar 05 '12 at 15:35