5

Im trying to perform a simple sum and mean by rows using data.table, but I am getting unexpected results. I followed the help in section 2 of the FAQ manual for data.table. I found a way that works, but I am not sure why this method in section 2 of the FAQ is not. This method gives me the incorrect result (i.e., it is giving me the value of the first column):

dt[, genesum:=lapply(.SD,sum), by=gene]
head(dt)

      gene      TCGA_04_1348      TCGA_04_1362   genesum  
  1:    A1BG          0.94565          0.70585  0.94565   
  2: A1BG-AS          0.97610          1.15850  0.97610   
  3:    A1CF          0.00000          0.02105  0.00000   
  4:   A2BP1          0.00300          0.04150  0.00300   
  5:   A2LD1          4.57975          5.02820  4.57975  
  6:     A2M         60.37320         36.09715 60.37320 

and this is giving me the desired result

dt[, genesum:=apply(dt[,-1, with=FALSE],1, sum)]
head(dt)

       gene     TCGA_04_1348       TCGA_04_1362 genesum
  1:    A1BG          0.94565          0.70585  1.65150
  2: A1BG-AS          0.97610          1.15850  2.13460
  3:    A1CF          0.00000          0.02105  0.02105
  4:   A2BP1          0.00300          0.04150  0.04450
  5:   A2LD1          4.57975          5.02820  9.60795
  6:     A2M         60.37320         36.09715 96.47035

I have many more columns and rows, this is just a subset. Does this have anything to do with the way I set the key?

tables()
 NAME        NROW    MB COLS                                               KEY                                             
 [1,] dt     20,785  2  gene,TCGA_04_1348_01A,TCGA_04_1362_01A,genesum    gene
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
sahir
  • 336
  • 3
  • 8

2 Answers2

4

A few things:

  1. dt[, genesum:=lapply(.SD,sum), by=gene] and dt[, genesum:=apply(dt[ ,-1],1, sum)] are quite different.

    • dt[, genesum:=lapply(.SD,sum), by=gene] loops over the columns of the .SD data.table and sums them

    • dt[, genesum:=apply(dt[, -1], 1, sum)] is looping over the rows (ie. apply(x, 1, function) applies function to every row in x

  2. I think you can get what you want by calling rowSums, like so:

    dt[, genesum := rowSums(dt[, -1])]
    

Is that what you're after?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Steve Lianoglou
  • 7,183
  • 2
  • 25
  • 21
  • I see the difference now. Efficiency wise, there's no difference between apply and rowSums? – sahir Feb 08 '14 at 22:19
  • @sahir: I reckon `rowSums` should be faster as I believe it avoids the loop in R, however it will also create a copy of its input to convert it to a matrix if it isn't one already. – Steve Lianoglou Feb 14 '14 at 04:34
  • How can I write functions like rowSums...lets say rowMedian ? – Indranil Gayen Feb 28 '17 at 12:53
  • @IndranilGayen: to avoid the looping in R you're going to have to write it in C/C++. Take a look at the [matrixStats](https://github.com/HenrikBengtsson/matrixStats) package, it provides many such row* and col* operations, and they're made fast by delegating to C. (I've linked to github instead of CRAN so you can more easily browse tis source code) – Steve Lianoglou Mar 01 '17 at 13:19
2

Here is one alternative (based on this SO question):

dt[ ,  genesum := sum(.SD[, -1, with=FALSE]), by = 1:NROW(dt) ]

another alternative:

# OR... you can create a column with row positions and apply your function by row
dt[, rowpos := .I]
dt[ ,  genesum := sum(.SD[, -1, with=FALSE]), by = rowpos]
Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109