3

Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...

I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.

As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?

Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)

library(data.table)

set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
                 q1_exam1 = round(rnorm(10, .78, .05), 2),
                 q1_exam2 = round(rnorm(10, .68, .02), 2),
                 q1_final = round(rnorm(10, .88, .08), 2),
                 q2_exam1 = round(rnorm(10, .78, .05), 2),
                 q2_exam2 = round(rnorm(10, .68, .10), 2),
                 q2_final = round(rnorm(10, .88, .04), 2))

dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89
Henrik
  • 65,555
  • 14
  • 143
  • 159
geneorama
  • 3,620
  • 4
  • 30
  • 41
  • 1
    If you want to do rowwise operations in `data.table`, you can create an index variable with `.I` and use that in the `by =` part. – Jaap Oct 26 '15 at 18:31
  • 1
    If you're concerned about efficiency, eddi's answer to my question here is quite good: http://stackoverflow.com/a/19279500/1191259 (and, I think, relevant) – Frank Oct 26 '15 at 18:35
  • 1
    I' guessing that you could gain efficiency and clarity either by having your data as "data.frame" with columns like ["id" "quarter" "exam"] and use database operations or by having a "list" (depending on a convenient factor) of "matrices" and use more 'arithmetic' operations. – alexis_laz Oct 26 '15 at 18:53
  • @Jaap Do you mean use `.I` like this: http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table/16574176#16574176 ? (I didn't initially realize this example might be relevant, thanks Frank for showing me this on github) – geneorama Oct 26 '15 at 19:04
  • 1
    Use `by = 1:nrow(dt)`. Or depending on the operation, you might be able to use `Reduce()` in `j`. – Arun Oct 26 '15 at 19:10
  • Sorry, voting to close as "primarily opinion based". If it gets closed before you've found a satisfactory answer, you might consider posting a different question (or editing this one). There is also a package mailing list -- not the github project site itself, but linked from the wiki under "discussion": https://github.com/Rdatatable/data.table/wiki/Getting-started – Frank Oct 26 '15 at 19:34
  • @frank How is this question opinion based? – geneorama Oct 27 '15 at 13:16
  • It's opinion-based, in stackoverflow parlance, because there are many possible answers to "how would you do this differently?" Usually, it would be recommended that you post this sort of thing on http://codereview.stackexchange.com/tour but I mention the mailing list because I have no experience with the code review site. – Frank Oct 27 '15 at 15:04
  • @Frank Ok, I edited the question – geneorama Oct 27 '15 at 16:36
  • Ok, thanks. That works, I think. – Frank Oct 27 '15 at 17:01

2 Answers2

4

Here are a few thoughts on your options, largely gathered from the comments:

apply along rows

The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.

rowMeans and similarly-named functions also coerce to a matrix.

Split by rows

As @Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.

Efficiently create new columns

This approach taken from eddi is probably the most efficient if you must keep your data in wide format:

jwts = list( 
  q1_AVG  = c(q1_exam1 = 1  , q1_exam2 = 1  , q1_final =   1)/3, 
  q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)


for (newj in names(jwts)){
  w = jwts[[newj]]
  dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}

This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).

Go long

As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like

# reshape
m = melt(dt, id.vars="id", value.name="score")[,
  c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]

# input your weighting rules
w = unique(m[,c("quarter","exam")])
w[quarter=="q1"                , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]

# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]

This is what I would do.


In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    The logic in "Go long" makes a lot of sense (and will work better for my work). At the risk of sounding opinion based, I'd say this seems like the "right" way to think about it. In fact, this sounds like the "right" way to think through the problem. – geneorama Oct 27 '15 at 17:17
  • Also, it's a good example to study to show the value of the new `tstrsplit` and `on` features. – geneorama Oct 27 '15 at 17:19
1

In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.

apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)

# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
#  [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333

In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:

dt[i = TRUE,
   q1_AVG := round(apply(.SD, 1, mean), 2), 
   .SDcols = grep("^q1", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77

The case with the weighted average can be calculated using matrix multiplication;

dt[i = TRUE,
   q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2), 
   .SDcols = grep("^q2", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77    0.80
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73    0.78
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74    0.82
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72    0.87
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75    0.77
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78    0.80
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73    0.80
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75    0.78
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76    0.77
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77    0.82
geneorama
  • 3,620
  • 4
  • 30
  • 41
  • I think if you use `rowMeans` on `matrix` and then the `%*%` on that matrix, would be efficient. In the code you showed, `apply` with MARGIN=1 is used, which is not that fast. Also, by converting `.SD` to `matrix`, it is not using the `data.table` techniques. – akrun Oct 26 '15 at 18:31