data.table row-wise sum, mean, min, max like dplyr?

Question

There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario

My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?

EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.

| Solution By | Speed compared to dplyr     |
|-------------|-----------------------------|
| Metrics v1  |  4.3 times SLOWER (use .SD) |
| Metrics v2  |  5.6 times FASTER           |
| ExperimenteR| 15   times FASTER           |
| Arun v1     |  3   times FASTER (Map func)|
| Arun v2     |  3   times FASTER (foo func)|
| Ista        |  4.5 times FASTER           |

EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.

Data Setup

library(data.table)
dt <- data.table(ProductName = c("Lettuce", "Beetroot", "Spinach", "Kale", "Carrot"),
    Country = c("CA", "FR", "FR", "CA", "CA"),
    Q1 = c(NA, 61, 40, 54, NA), Q2 = c(22,  8, NA,  5, NA),
    Q3 = c(51, NA, NA, 16, NA), Q4 = c(79, 10, 49, NA, NA))

#    ProductName Country Q1 Q2 Q3 Q4
# 1:     Lettuce      CA NA 22 51 79
# 2:    Beetroot      FR 61  8 NA 10
# 3:     Spinach      FR 40 NA NA 49
# 4:        Kale      CA 54  5 16 NA
# 5:      Carrot      CA NA NA NA NA

SOLUTION using dplyr + rowwise()

library(dplyr) ; library(magrittr)
dt %>% rowwise() %>% 
    transmute(ProductName, Country, Q1, Q2, Q3, Q4,
     AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))

#   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM NAcnt
# 1     Lettuce      CA NA 22 51 79 50.66667  22   79 152     1
# 2    Beetroot      FR 61  8 NA 10 26.33333   8   61  79     1
# 3     Spinach      FR 40 NA NA 49 44.50000  40   49  89     2
# 4        Kale      CA 54  5 16 NA 25.00000   5   54  75     1
# 5      Carrot      CA NA NA NA NA      NaN Inf -Inf   0     4

ERROR with data.table (compute entire column instead of per-row)

dt[, .(ProductName, Country, Q1, Q2, Q3, Q4,
    AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))]

#    ProductName Country Q1 Q2 Q3 Q4      AVG MIN MAX SUM NAcnt
# 1:     Lettuce      CA NA 22 51 79 35.90909   5  79 395     9
# 2:    Beetroot      FR 61  8 NA 10 35.90909   5  79 395     9
# 3:     Spinach      FR 40 NA NA 49 35.90909   5  79 395     9
# 4:        Kale      CA 54  5 16 NA 35.90909   5  79 395     9
# 5:      Carrot      CA NA NA NA NA 35.90909   5  79 395     9

ALMOST solution but more complex and missing Q1,Q2,Q3,Q4 output columns

dtmelt <- reshape2::melt(dt, id=c("ProductName", "Country"),
            variable.name="Quarter", value.name="Qty")

dtmelt[, .(AVG = mean(Qty, na.rm=TRUE),
    MIN = min (Qty, na.rm=TRUE),
    MAX = max (Qty, na.rm=TRUE),
    SUM = sum (Qty, na.rm=TRUE),
    NAcnt= sum(is.na(Qty))), by = list(ProductName, Country)]

#    ProductName Country      AVG MIN  MAX SUM NAcnt
# 1:     Lettuce      CA 50.66667  22   79 152     1
# 2:    Beetroot      FR 26.33333   8   61  79     1
# 3:     Spinach      FR 44.50000  40   49  89     2
# 4:        Kale      CA 25.00000   5   54  75     1
# 5:      Carrot      CA      NaN Inf -Inf   0     4

`dt[, AVG := rowMeans(.SD, na.rm=T),.SDcols=c(Q1, Q2,Q3,Q4)]` — ExperimenteR, Jul 07 '15 at 02:05
@ExperimenteR thanks (should SDcols be a char vector?) I tried this `dt[, .(Q1, Q2, Q3, Q4, AVG = rowMeans(.SD, na.rm=T), MIN = pmin(Q1,Q2,Q3,Q4, na.rm=T), MAX = pmax(Q1,Q2,Q3,Q4, na.rm=T) ), .SDcols=c("Q1","Q2","Q3","Q4")]` but still misses SUM and doesn't have ProductName, Country columns — Polymerase, Jul 07 '15 at 02:13
@Metrics there is no output b/c of evaluation error: `dt[, \`:=\` (AVG = rowMeans(.SD, na.rm=TRUE), MIN = min(.SD, na.rm=TRUE), MAX = max(.SD, na.rm=TRUE), SUM = sum(.SD, na.rm=TRUE)), .SDcols = c("Q1","Q2","Q3","Q4"), by=1:nrow(dt)] Warning messages: 1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to max; returning -Inf` — Polymerase, Jul 07 '15 at 02:50
See my answer. I have updated the code and removed from comments. Dplyr and data.table both issue warnings for NaN and -Inf. — Metrics, Jul 07 '15 at 02:52
data.table uses base R functions wherever possible so as to not impose a "walled garden" approach.. However base R doesn't have a nice function that does this operation :-(. So we'll have to implement `colwise()` and `rowwise()` functions as filed under [#1063](https://github.com/Rdatatable/data.table/issues/1063)... I've marked it for next release. — Arun, Jul 07 '15 at 10:52

ExperimenteR · Accepted Answer · 2015-07-07T10:46:10.560

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
          MAX = rowMaxs(as.matrix(.SD), na.rm=T),
          AVG = rowMeans(.SD, na.rm=T),
          SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]

dt
#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA 79 49  40   79 56.00000 168
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
                      MAX = rowMaxs(as.matrix(.SD), na.rm=T),
                      AVG = rowMeans(.SD, na.rm=T),
                      SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
#  user  system elapsed 
# 0.089   0.004   0.093

rowwise (or by=1:nrow(dt)) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>% 
  transmute(ProductName, Country, Q1, Q2, Q3, Q4,
            MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
#   user  system elapsed 
# 80.832   0.111  80.974 

system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
#    user  system elapsed 
# 141.492   0.196 141.757

your solution is the fastest! (see benchmarks in original question) Thanks for introducing the matrixStats package. I would like to know the impact on memory resources of your solution compared to that of Arun's and Metrics' 2nd solution. — Polymerase, Jul 07 '15 at 17:20
@ExperimenteR how's this able to work? `dt <- rbindlist(lapply(1:100000, function(i)dt))` . I tried decomposing it but returned error `dt(list(1))` . Elegant soln though — sahuno, Sep 01 '23 at 21:17
oh i see!!! you duplicated the original data.table multiple times and combined all their rows — sahuno, Sep 01 '23 at 23:05

Metrics · Answer 2 · 2015-07-07T12:31:31.363

19

With by=1:nrow(dt), performs the rowwise operation in data.table

 library(data.table)
dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c(Q1, Q2,Q3,Q4),by=1:nrow(dt)] 
   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM
1:     Lettuce      CA NA 22 51 79 50.66667  22   79 152
2:    Beetroot      FR 61  8 NA 10 26.33333   8   61  79
3:     Spinach      FR 40 NA 79 49 56.00000  40   79 168
4:        Kale      CA 54  5 16 NA 25.00000   5   54  75
5:      Carrot      CA NA NA NA NA      NaN Inf -Inf   0

Warning messages:
1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

You got warning messages, because in row 5, you are computing max, sum, min, and max of nothing. For example, see below:

min(c(NA,NA,NA,NA),na.rm=TRUE)
[1] Inf
Warning message:
In min(c(NA, NA, NA, NA), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf

edited Jul 07 '15 at 12:31

answered Jul 07 '15 at 02:48

Metrics

15,172
7
54
83

Same error, could that be b/c I am using latest data.table 1.9.4 (R version 3.2.0 (2015-04-16))? In addition, I must put SDcols in quotes `.SDcols=c("Q1","Q2","Q3","Q4")` to avoid "object 'Q1' not found". Here is the error when I run your code: `1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to max; returning -Inf` – Polymerase Jul 07 '15 at 02:57
Those are the warnings and not errors (I got it too). You got warnings because your output returns infinite values `-Inf`,`Inf`, and `NaN` (because you are taking the average, sum, min, and max of nothing). If you ran your own dplyr code, it also issues the same warnings. I am using development version 1.9.5+ (you can get it from github). I am not sure why you need to put quotes. It runs without quotes for me. See my updates in the answer. – Metrics Jul 07 '15 at 03:13
Oh that's true. I forgot to print(dt). Sorry! BTW, do you know why I got `object 'Q1' not found` if I don't put quotes around column names in `.SDcols=c(Q1,Q2,Q3,Q4)` (data.table 1.9.4, R v3.2.0) – Polymerase Jul 07 '15 at 04:14
Just applied your solution on a 10MB dataset, 73000 rows. The dplyr version is 4 times faster than the implementation you suggested. Could that be the as.numeric(.SD) in the calculation of AVG? – Polymerase Jul 07 '15 at 04:59
3

You can't benchmark on such small data set it is pretty meaningless. – David Arenburg Jul 07 '15 at 12:28
Yes @David. You are correct. It doesn't make sense. I have omitted it now. – Metrics Jul 07 '15 at 12:30
@Polymerase: I think it has to do with `.SD.`. Try this: where you have to enter all column names: `dt[,`:=`(AVF = mean (c(Q1, Q2, Q3, Q4), na.rm=TRUE),MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)),by=1:nrow(dt)]`. This is faster than your `dplyr` for your small sample data. – Metrics Jul 07 '15 at 12:37
@Metrics the 2nd version you suggested is very fast. Let me test all soltions here and I'll make a summary of all my tests. – Polymerase Jul 07 '15 at 16:35

Arun · Answer 3 · 2015-07-07T19:58:59.157

8

Just another way (not that efficient though, as na.omit() is called each time, and many memory allocations as well):

require(data.table)
new_cols = c("MIN", "MAX", "SUM", "AVG")
dt[, (new_cols) := Map(function(x, f) f(x), 
                       list(na.omit(c(Q1,Q2,Q3,Q4))), 
                       list(min, max, sum, mean)),
   by = 1:nrow(dt)]

#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX SUM      AVG
# 1:     Lettuce      CA NA 22 51 79  22   79 152 50.66667
# 2:    Beetroot      FR 61  8 NA 10   8   61  79 26.33333
# 3:     Spinach      FR 40 NA 79 49  40   79 168 56.00000
# 4:        Kale      CA 54  5 16 NA   5   54  75 25.00000
# 5:      Carrot      CA NA NA NA NA Inf -Inf   0      NaN

But as I mentioned, this'll get much simpler once colwise() and rowwise() are implemented. The syntax in this case could look something like:

dt[, rowwise(.SD, list(MIN=min, MAX=max, SUM=sum, AVG=mean), na.rm=TRUE), by = 1:nrow(dt)]
# `by = ` is really not necessary in this case.

or even more straightforward for this case:

rowwise(dt, list(...), na.rm=TRUE)

Edit:

Another variation:

myNACount <- function(x, ...) length(attributes(x)$na.action)
foo <- function(x, ...) {
    funs = c(min, max, mean, sum, myNACount)
    lapply(funs, function(f) f(x, ...))
}

dt[, (new_cols) := foo(na.omit(c(Q1, Q2, Q3, Q4)), na.rm=TRUE), by=1:nrow(dt)]
#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      SUM AVG NAs
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152   1
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79   1
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89   2
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75   1
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0   4

edited Jul 07 '15 at 19:58

answered Jul 07 '15 at 12:57

Arun

116,683
26
284
387

Yes, why did you add the `by` in the `rowwise` potential solution? – David Arenburg Jul 07 '15 at 12:59
There might be complex scenarios like `dt[, if (TRUE) do_bla else rowwise(...), by=some_cols]` (like I said, in this case, it isn't necessary). – Arun Jul 07 '15 at 13:02
Hi Arun, this is wonderful, the solution you suggested is 4 times FASTER than the dplyr version (tested on my real 10MB dataset). BTW, I have edited the original question (added the NAcount calculation). I have modified your example by adding MyNACount function. But got NAcnt=0 b/c na.omit() had removed all NA. Can you please suggested a solution? `MyNACount <- function(vectNum) { sum(is.na(vectNum)) } new_cols = c("AVG", "MIN", "MAX", "SUM", "NAcnt") dt[, (new_cols) := Map(function(x, f) f(x), list(na.omit(c(Q1,Q2,Q3,Q4))), list(mean, min, max, sum, MyNACount)), by = 1:nrow(dt)]` – Polymerase Jul 07 '15 at 15:56
@Polymerase, you can define `myNACount` as follows: `myNACount <- function(x) length(attributes(x)$na.action)`. – Arun Jul 07 '15 at 19:52
1

@Arun That `myNACount <- function(x) length(attributes(x)$na.action)` function is outstanding. Thanks. I wish I could understand the mechanism of the optimization. The 2nd variation you suggested is blazingly fast. – Polymerase Jul 07 '15 at 21:19
1

@Arun Ahem ... sorry I made a mistake in the benchmark measure. The 2nd variation you made is slightly faster than the 1st version. The fastest exec time is from ExperimenteR's solution. – Polymerase Jul 07 '15 at 21:43
1

@Polymerase, no worries. I think we all learned quite a bit here :-). Great Q. – Arun Jul 07 '15 at 21:53

score 2 · Answer 4 · edited Apr 22 '20 at 16:05

2

The apply function can be used to perform row-wise calculations. Defining the function separately keeps things cleaner:

dstats <- function(x){
    c(mean(x,na.rm=TRUE),
      min(x, na.rm=TRUE),
      max(x, na.rm=TRUE),
      sum(x, na.rm=TRUE))
}

The function can now be applied over the rows of the data.table.

(dt[,
   c("AVG", "MIN", "MAX", "SUM") := data.frame(t(apply(.SD, 1, dstats))),
   .SDcols=c("Q1", "Q2","Q3","Q4"),
])

Notice that the only advantage of doing this with [.data.table is that it allows the use of := for fast adding by reference.

This is slower but more flexible than the matrixStats solution, and faster than the dplyr solution by @ExperimenteR, clocking in at 36 seconds (my timings for the other methods were similar to those in @ExperimenteR's answer).

edited Apr 22 '20 at 16:05

MichaelChirico

33,841
14
113
198

answered Jul 07 '15 at 14:47

Ista

10,139
2
37
38

1

1. `apply()` converts `.SD` to a matrix = mem alloc. 2. `t()` transposes result = another copy. 3. `data.frame()` = another memory alloc. Not sure of the need for `with = FALSE` here. We can certainly do better by avoiding all these copies. – Arun Jul 07 '15 at 15:18
@Arun Perhaps, but it is fairly quick already, and we can use `matrixStats` if we need more speed. I have `with = FALSE` because `help(":=")` implies that this is needed when the RHS returns a list. – Ista Jul 07 '15 at 15:27
Fairly quick isn't good enough, really, especially when it's trivial to be much more efficient. I've replied to your reply on github project page detailing the reasons. On `with=FALSE`, that's not what it means, but I understand the confusion. Will fix. – Arun Jul 07 '15 at 15:44
@Ista your solution is the 2nd fastest, see benchmark results in original question. – Polymerase Jul 07 '15 at 17:14

score 0 · Answer 5 · answered Apr 26 '20 at 08:27

I hope others when encountering the same problem, they might find helpful.

1st Approach: Combining base R

dt[,`:=`(MIN = apply(dt[, Q1:Q4], 1, FUN = min, na.rm=TRUE),
       MAX = apply(dt[, Q1:Q4], 1, FUN = max, na.rm = TRUE),
       AVG = rowMeans(dt[, Q1:Q4], na.rm = TRUE),
       SUM = rowSums(dt[, Q1:Q4], na.rm = TRUE))][]
# ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

2nd Approach: based on @ExperimenteR idea, using matrixStats package

dt1 <- dt[,`:=`(MIN = rowMins(as.matrix(dt[, Q1:Q4]), na.rm=TRUE),
                MAX = rowMaxs(as.matrix(dt[, Q1:Q4]), na.rm = TRUE),
                AVG = rowMeans(dt[, Q1:Q4], na.rm = TRUE),
                SUM = rowSums(dt[, Q1:Q4], na.rm = TRUE))][]
# ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

data.table row-wise sum, mean, min, max like dplyr?

5 Answers5

1st Approach: Combining base R

2nd Approach: based on @ExperimenteR idea, using matrixStats package

Linked

Related