13

I'm trying to do a simple thing, divide 40 columns of a data.table by their mean. I cannot provide the actual data (not all columns are numeric, and I have > 8M rows), but here's an example:

library(data.table)   

dt <- data.table(matrix(sample(1:100,4000,T),ncol=40))
colmeans <- colMeans(dt)

Next I thought I would do:

for (col in names(colmeans)) dt[,col:=dt[,col]/colmeans[col]]   

But this returns an error since dt[,col] require that column names are not quoted. Using as.name(col) doesn't cut it. Now,

res <- t(t(dt[,1:40,with=F]/colmeans))

contains the expeded result, but I cannot insert it back in the data.table, as

dt[,1:40] <- res

does not work, neither does dt[,1:40:=res, with=F].

The following works, but I find it quite ugly:

for (i in seq_along(colmeans)) dt[,i:=dt[,i,with=F]/colmeans[i],with=F]

Sure, I could also recreate an new data.table by calling data.table() on res and the other non-numerical columns my data.table has, but isn't their anything more efficient?

zx8754
  • 52,746
  • 12
  • 114
  • 209
jeanlain
  • 382
  • 1
  • 3
  • 13

4 Answers4

43

How about

dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))]

If you need to specify certain columns, you could use

dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40]

or

cols <- names(dt)[c(1,5,10)]
dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols]
talat
  • 68,970
  • 21
  • 126
  • 157
  • Ok, but using lapply would be significantly slower than colMeans if there are many columns. – jeanlain Jun 09 '16 at 09:41
  • 4
    @jeanlain I don't think so. – Roland Jun 09 '16 at 10:22
  • Very elegant solution! – Ken Benoit Jun 09 '16 at 13:13
  • Ok it seems more elegant than what I did and efficiency should be ok since there aren't that many columns. I need to benchmark againts a solution based on `colMeans`. On matrices, calling `apply()` to compute the mean of each row/column is significantly slower than `colMeans()` or `rowMeans`. – jeanlain Jun 09 '16 at 13:42
  • 4
    `colMeans` copies the whole data.table and converts it into a matrix. For data of relatively large size you want to avoid such copies. Also, if you have many columns, you maybe have a "wide format" table and should consider if it shouldn't be "long format". data.table is more efficient with few columns and many rows than with many columns and few rows. – Roland Jun 09 '16 at 16:29
  • Why is it necessary to wrap `cols` in brackets in the last case? – Alex May 22 '18 at 09:41
5

We can also use set. In this case, there should be no noticeable difference to using [.data.table along with :=, but in scenarios where [.data.table has to be called multiple times, using set() helps avoid that overhead and could be noticeably faster.

for(j in names(dt)) {
 set(dt, i=NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

It can be also done on selected columns, i.e.

nm1 <- names(dt)[1:5]
for(j in nm1){
 set(dt, i = NULL, j = j, value = dt[[j]]/mean(dt[[j]]))
}

data

set.seed(24)
dt <- as.data.frame(matrix(sample(1:100,4000,TRUE),ncol=40))
setDT(dt)
Arun
  • 116,683
  • 26
  • 284
  • 387
akrun
  • 874,273
  • 37
  • 540
  • 662
3

dplyr 0.4.3

To divide all columns by their mean, you could do:

dplyr::mutate_each(dt, funs(. / mean(.)))

Or to specify column positions:

dplyr::mutate_each(dt, funs(. / mean(.)), 5:10)

Or column names:

dplyr::mutate_each_(dt, funs(. / mean(.)), colnames(dt)[5:10])

dplyr 0.4.3.9000

If you only want to divide numeric columns, the devel version of dplyr has mutate_if which operates on columns for which a predicate returns TRUE

dplyr::mutate_if(dt, is.numeric, funs(. / mean(.)))
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    That is impressive with `dplyr`. – akrun Jun 09 '16 at 10:07
  • You might need to reassign the output in your dplyr call to achieve equal results – talat Jun 09 '16 at 10:22
  • If you're on the devel version, you'd have to load `dtplyr` to run data.table equivalent methods. Else this is running on data.frame equivalent ones. Also, you'll need a `copy()` prior to running data.table functions to get the same output, with an additional benchmarking of `copy()` alone (to subtract later). Else from the 2nd time on, you'd be computing the mean on `double`, not `integer`. – Arun Jun 09 '16 at 10:39
  • @Arun, thanks for the info, I have wondered about whether microbnechmark makes some internal copies of the original data by itself or whether the user has to do so - good to know. This also means that the initial run is on integer values and later runs on numeric/double, right? – talat Jun 09 '16 at 10:43
  • @docendodiscimus exactly. – Arun Jun 09 '16 at 10:44
  • @StevenBeaupré, that would be easier if you provide the complete code including the ggplot creation, so others just need to copy-paste it – talat Jun 09 '16 at 10:51
  • I'm not particularly interested about the benchmark (just because there's nothing interesting there). The function is quite straightforward, and both packages do the same (how many ways are there to compute x/mean(x) :-)?). The replications and run time are also small. And probably at this run time, the integer/double has an effect.. But my intent was just to point out benchmarking `:=`/ref operations. – Arun Jun 09 '16 at 10:55
1

How about a bit of melt and dcast magic. This converts the data to "long" format and then back to the original "wide".

First, melt the variable on an ID:

# make an ID variable
dt[, idvar := 1:nrow(dt)]
# melt the data on the ID variable
dt2 <- melt(dt, "idvar")

Then do the division by mean operation, on each group:

# use data.table by = to do a fast division by group mean
dt2[, divByMean := value / mean(value), by = variable]
dt2
## idvar variable value divByMean
## 1:     1       V1    15 0.2859867
## 2:     2       V1    92 1.7540515
## 3:     3       V1    27 0.5147760
## 4:     4       V1     7 0.1334604
## 5:     5       V1    18 0.3431840
## ---                               
## 3996:    96      V40    54 1.1111111
## 3997:    97      V40    51 1.0493827
## 3998:    98      V40    23 0.4732510
## 3999:    99      V40     8 0.1646091
## 4000:   100      V40    11 0.2263374

Then back to the original wide format:

# now dcast back to "wide"
dt3 <- dcast(dt2, idvar ~ variable, mean, value.var = "divByMean")
dt3[1:5, 1:5]
##   idvar        V1        V2        V3        V4
## 1     1 0.2859867 0.6913303 0.2110919 1.6156624
## 2     2 1.7540515 0.7847534 0.5948954 1.8817715
## 3     3 0.5147760 0.2615845 0.8827480 0.4181715
## 5     5 0.3431840 0.3550075 0.3646133 0.3231325
## 4     4 0.1334604 1.7937220 1.3241220 1.3685611
Ken Benoit
  • 14,454
  • 27
  • 50