Row operations in data.table

Question

Im trying to perform a simple sum and mean by rows using data.table, but I am getting unexpected results. I followed the help in section 2 of the FAQ manual for data.table. I found a way that works, but I am not sure why this method in section 2 of the FAQ is not. This method gives me the incorrect result (i.e., it is giving me the value of the first column):

dt[, genesum:=lapply(.SD,sum), by=gene]
head(dt)

      gene      TCGA_04_1348      TCGA_04_1362   genesum  
  1:    A1BG          0.94565          0.70585  0.94565   
  2: A1BG-AS          0.97610          1.15850  0.97610   
  3:    A1CF          0.00000          0.02105  0.00000   
  4:   A2BP1          0.00300          0.04150  0.00300   
  5:   A2LD1          4.57975          5.02820  4.57975  
  6:     A2M         60.37320         36.09715 60.37320

and this is giving me the desired result

dt[, genesum:=apply(dt[,-1, with=FALSE],1, sum)]
head(dt)

       gene     TCGA_04_1348       TCGA_04_1362 genesum
  1:    A1BG          0.94565          0.70585  1.65150
  2: A1BG-AS          0.97610          1.15850  2.13460
  3:    A1CF          0.00000          0.02105  0.02105
  4:   A2BP1          0.00300          0.04150  0.04450
  5:   A2LD1          4.57975          5.02820  9.60795
  6:     A2M         60.37320         36.09715 96.47035

I have many more columns and rows, this is just a subset. Does this have anything to do with the way I set the key?

tables()
 NAME        NROW    MB COLS                                               KEY                                             
 [1,] dt     20,785  2  gene,TCGA_04_1348_01A,TCGA_04_1362_01A,genesum    gene

score 4 · Accepted Answer · edited Feb 15 '20 at 04:35

4

A few things:

dt[, genesum:=lapply(.SD,sum), by=gene] and dt[, genesum:=apply(dt[ ,-1],1, sum)] are quite different.
- dt[, genesum:=lapply(.SD,sum), by=gene] loops over the columns of the .SD data.table and sums them
- dt[, genesum:=apply(dt[, -1], 1, sum)] is looping over the rows (ie. apply(x, 1, function) applies function to every row in x
I think you can get what you want by calling rowSums, like so:
```
dt[, genesum := rowSums(dt[, -1])]
```

Is that what you're after?

edited Feb 15 '20 at 04:35

MichaelChirico

33,841
14
113
198

answered Feb 08 '14 at 21:38

Steve Lianoglou

7,183
2
25
21

I see the difference now. Efficiency wise, there's no difference between apply and rowSums? – sahir Feb 08 '14 at 22:19
@sahir: I reckon `rowSums` should be faster as I believe it avoids the loop in R, however it will also create a copy of its input to convert it to a matrix if it isn't one already. – Steve Lianoglou Feb 14 '14 at 04:34
How can I write functions like rowSums...lets say rowMedian ? – Indranil Gayen Feb 28 '17 at 12:53
@IndranilGayen: to avoid the looping in R you're going to have to write it in C/C++. Take a look at the [matrixStats](https://github.com/HenrikBengtsson/matrixStats) package, it provides many such row* and col* operations, and they're made fast by delegating to C. (I've linked to github instead of CRAN so you can more easily browse tis source code) – Steve Lianoglou Mar 01 '17 at 13:19

score 2 · Answer 2 · edited May 23 '17 at 10:29

2

Here is one alternative (based on this SO question):

dt[ ,  genesum := sum(.SD[, -1, with=FALSE]), by = 1:NROW(dt) ]

another alternative:

# OR... you can create a column with row positions and apply your function by row
dt[, rowpos := .I]
dt[ ,  genesum := sum(.SD[, -1, with=FALSE]), by = rowpos]

edited May 23 '17 at 10:29

Community

1
1

answered Jun 06 '16 at 21:29

rafa.pereira

13,251
6
71
109

Row operations in data.table

2 Answers2

Linked