Update multiple data.table columns elegantly

Question

I'm trying to do a simple thing, divide 40 columns of a data.table by their mean. I cannot provide the actual data (not all columns are numeric, and I have > 8M rows), but here's an example:

library(data.table)   

dt <- data.table(matrix(sample(1:100,4000,T),ncol=40))
colmeans <- colMeans(dt)

Next I thought I would do:

for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]]

But this returns an error since dt[,col] require that column names are not quoted. Using as.name(col) doesn't cut it. Now,

res <- t(t(dt[,1:40,with=F]/colmeans))

contains the expeded result, but I cannot insert it back in the data.table, as

dt[,1:40] <- res

does not work, neither does dt[,1:40:=res, with=F].

The following works, but I find it quite ugly:

for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F]

Sure, I could also recreate an new data.table by calling data.table() on res and the other non-numerical columns my data.table has, but isn't their anything more efficient?

score 43 · Accepted Answer · answered Jun 09 '16 at 08:43

43

How about

dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))]

If you need to specify certain columns, you could use

dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40]

or

cols <- names(dt)[c(1,5,10)]
dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols]

answered Jun 09 '16 at 08:43

talat

68,970
21
126
157

Ok, but using lapply would be significantly slower than colMeans if there are many columns. – jeanlain Jun 09 '16 at 09:41
4

@jeanlain I don't think so. – Roland Jun 09 '16 at 10:22
Very elegant solution! – Ken Benoit Jun 09 '16 at 13:13
Ok it seems more elegant than what I did and efficiency should be ok since there aren't that many columns. I need to benchmark againts a solution based on `colMeans`. On matrices, calling `apply()` to compute the mean of each row/column is significantly slower than `colMeans()` or `rowMeans`. – jeanlain Jun 09 '16 at 13:42
4

`colMeans` copies the whole data.table and converts it into a matrix. For data of relatively large size you want to avoid such copies. Also, if you have many columns, you maybe have a "wide format" table and should consider if it shouldn't be "long format". data.table is more efficient with few columns and many rows than with many columns and few rows. – Roland Jun 09 '16 at 16:29
Why is it necessary to wrap `cols` in brackets in the last case? – Alex May 22 '18 at 09:41

score 5 · Answer 2 · edited Jun 09 '16 at 10:19

5

We can also use set. In this case, there should be no noticeable difference to using [.data.table along with :=, but in scenarios where [.data.table has to be called multiple times, using set() helps avoid that overhead and could be noticeably faster.

for(j in names(dt)) {
 set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

It can be also done on selected columns, i.e.

nm1 <- names(dt)[1:5]
for(j in nm1){
 set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

data

set.seed(24)
dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40))
setDT(dt)

edited Jun 09 '16 at 10:19

Arun

116,683
26
284
387

answered Jun 09 '16 at 08:45

akrun

874,273
37
540
662

1

`[.data.table` is called only once here. Like I mentioned in another comment. – Arun Jun 09 '16 at 10:03
2

The `for( ) set( )` construction is a very nice solution, and avoids the pesky `.SD` allocaiton – SymbolixAU Jun 09 '16 at 10:26
@SymbolixAU, good point, added [#1735](https://github.com/Rdatatable/data.table/issues/1735). – Arun Jun 09 '16 at 11:02

Steven Beaupré · Answer 3 · 2016-06-10T17:42:47.443

3

dplyr 0.4.3

To divide all columns by their mean, you could do:

dplyr::mutate_each(dt, funs(. / mean(.)))

Or to specify column positions:

dplyr::mutate_each(dt, funs(. / mean(.)), 5:10)

Or column names:

dplyr::mutate_each_(dt, funs(. / mean(.)), colnames(dt)[5:10])

dplyr 0.4.3.9000

If you only want to divide numeric columns, the devel version of dplyr has mutate_if which operates on columns for which a predicate returns TRUE

dplyr::mutate_if(dt, is.numeric, funs(. / mean(.)))

edited Jun 10 '16 at 17:42

answered Jun 09 '16 at 10:01

Steven Beaupré

21,343
7
57
77

1

That is impressive with `dplyr`. – akrun Jun 09 '16 at 10:07
You might need to reassign the output in your dplyr call to achieve equal results – talat Jun 09 '16 at 10:22
If you're on the devel version, you'd have to load `dtplyr` to run data.table equivalent methods. Else this is running on data.frame equivalent ones. Also, you'll need a `copy()` prior to running data.table functions to get the same output, with an additional benchmarking of `copy()` alone (to subtract later). Else from the 2nd time on, you'd be computing the mean on `double`, not `integer`. – Arun Jun 09 '16 at 10:39
@Arun, thanks for the info, I have wondered about whether microbnechmark makes some internal copies of the original data by itself or whether the user has to do so - good to know. This also means that the initial run is on integer values and later runs on numeric/double, right? – talat Jun 09 '16 at 10:43
@docendodiscimus exactly. – Arun Jun 09 '16 at 10:44
@StevenBeaupré, that would be easier if you provide the complete code including the ggplot creation, so others just need to copy-paste it – talat Jun 09 '16 at 10:51
I'm not particularly interested about the benchmark (just because there's nothing interesting there). The function is quite straightforward, and both packages do the same (how many ways are there to compute x/mean(x) :-)?). The replications and run time are also small. And probably at this run time, the integer/double has an effect.. But my intent was just to point out benchmarking `:=`/ref operations. – Arun Jun 09 '16 at 10:55

score 1 · Answer 4 · answered Jun 09 '16 at 09:00

How about a bit of melt and dcast magic. This converts the data to "long" format and then back to the original "wide".

First, melt the variable on an ID:

# make an ID variable
dt[, idvar := 1:nrow(dt)]
# melt the data on the ID variable
dt2 <- melt(dt, "idvar")

Then do the division by mean operation, on each group:

# use data.table by = to do a fast division by group mean
dt2[, divByMean := value / mean(value), by = variable]
dt2
## idvar variable value divByMean
## 1:     1       V1    15 0.2859867
## 2:     2       V1    92 1.7540515
## 3:     3       V1    27 0.5147760
## 4:     4       V1     7 0.1334604
## 5:     5       V1    18 0.3431840
## ---                               
## 3996:    96      V40    54 1.1111111
## 3997:    97      V40    51 1.0493827
## 3998:    98      V40    23 0.4732510
## 3999:    99      V40     8 0.1646091
## 4000:   100      V40    11 0.2263374

Then back to the original wide format:

# now dcast back to "wide"
dt3 <- dcast(dt2, idvar ~ variable, mean, value.var = "divByMean")
dt3[1:5, 1:5]
##   idvar        V1        V2        V3        V4
## 1     1 0.2859867 0.6913303 0.2110919 1.6156624
## 2     2 1.7540515 0.7847534 0.5948954 1.8817715
## 3     3 0.5147760 0.2615845 0.8827480 0.4181715
## 5     5 0.3431840 0.3550075 0.3646133 0.3231325
## 4     4 0.1334604 1.7937220 1.3241220 1.3685611

Update multiple data.table columns elegantly

4 Answers4

data

Linked

Related