41

I am not very clear about use of .SD and by.

For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]

However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)

DT[,lapply(.SD,sum),by=.(A,B)]

Then how do I do a simple colsum over all the columns except A and B?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
KTY
  • 709
  • 1
  • 9
  • 17
  • 3
    `DT[,colSums(.SD),.SDcols=-c(A, B)]` or `DT[,lapply(.SD, sum),.SDcols=-c(A, B)]` – Khashaa Aug 28 '15 at 17:51
  • Using `by` means that within each `A`x`B` pairing, you sum the value of every _other_ column in `DT`. @Khashaa's comment is (a few of the ways ) how to sum over all columns excepting `A` and `B`, _not by group_ – MichaelChirico Aug 28 '15 at 17:52
  • @MichaelChirico,When I change the column type though as in the first example, `by` means exclude I guess, right? and which one is faster? `colSums` or `lapply` ? – KTY Aug 28 '15 at 17:56
  • 3
    `by` does not mean exclude. It's just that the value of .SD refers only to other columns when `by` is used (which is a strange rule, I think). Regarding `colSums`, don't use it, as mentioned at the bottom here: https://github.com/Rdatatable/data.table/wiki/Do%27s-and-Don%27ts – Frank Aug 28 '15 at 17:59
  • @Frank nice reference on `colSums`, however I'm curious which is faster if you're _not_ doing it by group, but rather over the whole table. – MichaelChirico Aug 28 '15 at 18:04
  • I think the most natural way if you're summing over the whole table (except two cols) is `sapply( DT[, !c("A","B"), with=FALSE], sum)`. You say you tried the three methods... what did you find? – Frank Aug 28 '15 at 18:10
  • @Frank Is that because `sum` is internally optimized? (explained at length by @Arun here http://stackoverflow.com/a/31854111/3573401) – Khashaa Aug 28 '15 at 18:12
  • @Khashaa Oh, maybe I'm wrong then, if data.table has some magic super-performant sum that only works in `j`... in which case Michael's answers below should be better. – Frank Aug 28 '15 at 18:14
  • @Frank So I tried these three methods. IV is my data table with 100M rows and I am taking sum over 26 columns. vector m includes these 26 columns that I am interested in. 1) `colSums(IV[ , 3:28, with=F])` ; 2)`IV[ , colSums(.SD) , .SDcols=m]` ; 3) `IV[,lapply(.SD,sum),.SDcols=m]` . So number 2 and 3 are sort of comparable in terms of speed (however, number 3 with `lapply` is the slowest). Number 1 is very fast though. Do you think it is because in numbers 2 and 3, it does some kind of a for loop? especially `lapply`? – KTY Aug 28 '15 at 18:15
  • I also noticed that when I use `lapply` to convert multiple columns to `factor`, it is very slow compared to updating each column line by line using `:=` . – KTY Aug 28 '15 at 18:17
  • @KTY I'm not sure about why your benchmark turned out that way. Maybe Michael can add that to his benchmark test below. Regarding conversion of multiple columns to factor, you could consider `set` in a loop, mentioned in my comment on the answer below. – Frank Aug 28 '15 at 18:19

1 Answers1

71

Just to illustrate the comments above with an example, let's take

set.seed(10238)
# A and B are the "id" variables within which the
#   "data" variables C and D vary meaningfully
DT = data.table(
  A = rep(1:3, each = 5L), 
  B = rep(1:5, 3L),
  C = sample(15L),
  D = sample(15L)
)
DT
#     A B  C  D
#  1: 1 1 14 11
#  2: 1 2  3  8
#  3: 1 3 15  1
#  4: 1 4  1 14
#  5: 1 5  5  9
#  6: 2 1  7 13
#  7: 2 2  2 12
#  8: 2 3  8  6
#  9: 2 4  9 15
# 10: 2 5  4  3
# 11: 3 1  6  5
# 12: 3 2 12 10
# 13: 3 3 10  4
# 14: 3 4 13  7
# 15: 3 5 11  2

Compare the following:

#Sum all columns
DT[ , lapply(.SD, sum)]
#     A  B   C   D
# 1: 30 45 120 120

#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
#    A  B  C  D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28

#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
#     B   C   D
# 1: 45 120 120

#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
#    B  C  D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]

library(microbenchmark)
microbenchmark(
  times = 100L,
  colsums = colSums(DT[ , !c("A", "B")]),
  lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962   100
#  lapplys  246.5824  250.3753  252.9603  252.1586  254.8297  266.1771   100
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 4
    A `for` loop with `set` is another option for converting a large number of columns, mentioned at the bottom of this answer and suggested/endorsed by Arun and Matt: http://stackoverflow.com/a/16846530/1191259 – Frank Aug 28 '15 at 18:16
  • @Frank yes, and that's what I do now, but it took me quite some time to wrap my head around what was going on there. But it's great to be exposed to that early. – MichaelChirico Aug 28 '15 at 18:18
  • @MichaelChirico, thanks for the tip on changing the column types without creating a new DT! very helpful. – KTY Aug 28 '15 at 18:29