1
library(data.table)
dt = as.data.table(mtcars)

Let's say I would like to replace values of a variable by zero if the value is less than the mean of the variable. And I would like to do that for a set of variables.

This is base-r way:

for (v in c('mpg','cyl',  'disp',  'hp', 'drat')) {
    dt[dt[[v]] < mean(dt[[v]]), (v) := 0]
}

However, I was looking for a native data.table approach if there is. How can I tell data.table to treat v as a name of a variable instead of a variable inside dt?

This was what I was thinking would work:

for (v in c('mpg','cyl',  'disp',  'hp', 'drat')) {
    dt[v < mean(v), (v) := 0]
}
Eyayaw
  • 1,033
  • 5
  • 10
  • 2
    In j you can use a prefix like `..v` but it has not been extended to `i` yet I think https://stackoverflow.com/q/45380628/1191259 You can compose an expression and pass like `dt[eval(i_expr)]` though – Frank Apr 05 '21 at 21:24
  • Exactly, I was looking for an operation in i using what is possible in j with `with=FALSE` `..v` – Eyayaw Apr 06 '21 at 07:50

4 Answers4

3

This is one way to do it in data.table:

library(data.table)

dt <- as.data.table(mtcars)

v <- c('mpg','cyl',  'disp',  'hp', 'drat')

dt[, (v) := lapply(.SD, function(x) ifelse(x < (mean(x, na.rm = TRUE)), 0, x)), .SDcols = v]

#>      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#>  1: 21.0   0   0.0   0 3.90 2.620 16.46  0  1    4    4
#>  2: 21.0   0   0.0   0 3.90 2.875 17.02  0  1    4    4
#>  3: 22.8   0   0.0   0 3.85 2.320 18.61  1  1    4    1
#>  4: 21.4   0 258.0   0 0.00 3.215 19.44  1  0    3    1
#>  5:  0.0   8 360.0 175 0.00 3.440 17.02  0  0    3    2
#>  6:  0.0   0   0.0   0 0.00 3.460 20.22  1  0    3    1
#>  7:  0.0   8 360.0 245 0.00 3.570 15.84  0  0    3    4
#>  8: 24.4   0   0.0   0 3.69 3.190 20.00  1  0    4    2
#>  9: 22.8   0   0.0   0 3.92 3.150 22.90  1  0    4    2
#> 10:  0.0   0   0.0   0 3.92 3.440 18.30  1  0    4    4
#> 11:  0.0   0   0.0   0 3.92 3.440 18.90  1  0    4    4
#> 12:  0.0   8 275.8 180 0.00 4.070 17.40  0  0    3    3
#> 13:  0.0   8 275.8 180 0.00 3.730 17.60  0  0    3    3
#> 14:  0.0   8 275.8 180 0.00 3.780 18.00  0  0    3    3
#> 15:  0.0   8 472.0 205 0.00 5.250 17.98  0  0    3    4
#> 16:  0.0   8 460.0 215 0.00 5.424 17.82  0  0    3    4
#> 17:  0.0   8 440.0 230 0.00 5.345 17.42  0  0    3    4
#> 18: 32.4   0   0.0   0 4.08 2.200 19.47  1  1    4    1
#> 19: 30.4   0   0.0   0 4.93 1.615 18.52  1  1    4    2
#> 20: 33.9   0   0.0   0 4.22 1.835 19.90  1  1    4    1
#> 21: 21.5   0   0.0   0 3.70 2.465 20.01  1  0    3    1
#> 22:  0.0   8 318.0 150 0.00 3.520 16.87  0  0    3    2
#> 23:  0.0   8 304.0 150 0.00 3.435 17.30  0  0    3    2
#> 24:  0.0   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 25:  0.0   8 400.0 175 0.00 3.845 17.05  0  0    3    2
#> 26: 27.3   0   0.0   0 4.08 1.935 18.90  1  1    4    1
#> 27: 26.0   0   0.0   0 4.43 2.140 16.70  0  1    5    2
#> 28: 30.4   0   0.0   0 3.77 1.513 16.90  1  1    5    2
#> 29:  0.0   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 30:  0.0   0   0.0 175 3.62 2.770 15.50  0  1    5    6
#> 31:  0.0   8 301.0 335 0.00 3.570 14.60  0  1    5    8
#> 32: 21.4   0   0.0   0 4.11 2.780 18.60  1  1    4    2
#>      mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Created on 2021-04-05 by the reprex package (v0.3.0)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • Thank you for the answer. I was actually looking for an operation on rows (`i`) explicitly. If there is some way to invoke `with = FALSE` as in the case of an operation on j. – Eyayaw Apr 05 '21 at 13:23
2

Here is an option:

for (x in c('mpg','cyl',  'disp',  'hp', 'drat')) {
   eval(substitute(dt[v < mean(v), v := 0], list(v=as.name(x))))
}
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
2

I guess you are looking for get which allows you to access the variable using a character :

for (v in c('mpg','cyl',  'disp',  'hp', 'drat')) {
    dt[get(v) < mean(get(v)), c(v) := 0]
}

should work

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1: 21.0   0   0.0   0 3.90 2.620 16.46  0  1    4    4
 2: 21.0   0   0.0   0 3.90 2.875 17.02  0  1    4    4
 3: 22.8   0   0.0   0 3.85 2.320 18.61  1  1    4    1
 4: 21.4   0 258.0   0 0.00 3.215 19.44  1  0    3    1
 5:  0.0   8 360.0 175 0.00 3.440 17.02  0  0    3    2
 6:  0.0   0   0.0   0 0.00 3.460 20.22  1  0    3    1
 7:  0.0   8 360.0 245 0.00 3.570 15.84  0  0    3    4
 8: 24.4   0   0.0   0 3.69 3.190 20.00  1  0    4    2
 9: 22.8   0   0.0   0 3.92 3.150 22.90  1  0    4    2
10:  0.0   0   0.0   0 3.92 3.440 18.30  1  0    4    4
11:  0.0   0   0.0   0 3.92 3.440 18.90  1  0    4    4
12:  0.0   8 275.8 180 0.00 4.070 17.40  0  0    3    3
13:  0.0   8 275.8 180 0.00 3.730 17.60  0  0    3    3
14:  0.0   8 275.8 180 0.00 3.780 18.00  0  0    3    3
15:  0.0   8 472.0 205 0.00 5.250 17.98  0  0    3    4
16:  0.0   8 460.0 215 0.00 5.424 17.82  0  0    3    4
17:  0.0   8 440.0 230 0.00 5.345 17.42  0  0    3    4
18: 32.4   0   0.0   0 4.08 2.200 19.47  1  1    4    1
19: 30.4   0   0.0   0 4.93 1.615 18.52  1  1    4    2
20: 33.9   0   0.0   0 4.22 1.835 19.90  1  1    4    1
21: 21.5   0   0.0   0 3.70 2.465 20.01  1  0    3    1
22:  0.0   8 318.0 150 0.00 3.520 16.87  0  0    3    2
23:  0.0   8 304.0 150 0.00 3.435 17.30  0  0    3    2
24:  0.0   8 350.0 245 3.73 3.840 15.41  0  0    3    4
25:  0.0   8 400.0 175 0.00 3.845 17.05  0  0    3    2
26: 27.3   0   0.0   0 4.08 1.935 18.90  1  1    4    1
27: 26.0   0   0.0   0 4.43 2.140 16.70  0  1    5    2
28: 30.4   0   0.0   0 3.77 1.513 16.90  1  1    5    2
29:  0.0   8 351.0 264 4.22 3.170 14.50  0  1    5    4
30:  0.0   0   0.0 175 3.62 2.770 15.50  0  1    5    6
31:  0.0   8 301.0 335 0.00 3.570 14.60  0  1    5    8
32: 21.4   0   0.0   0 4.11 2.780 18.60  1  1    4    2
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb

You will see that it behave differently than with = F:

dt[,"mpg"] # get a data.table of one column, idem dt[,.(mpg)]
dt[,mpg] # get a vector

var <- "mpg"
dt[,var] #error

dt[,get(var)] # same as dt[,mpg]
dt[,var,with = F] # same as dt[,"mpg"]
denis
  • 5,580
  • 1
  • 13
  • 40
0

Another data.table approach is to use the set() function and its i= argument to update by reference on a subset of rows. Here is how to do this:

library(data.table)
dt = as.data.table(mtcars)

for (v in c('mpg','cyl',  'disp',  'hp', 'drat')) {
  my_rows <- which(dt[[v]] < mean(dt[[v]], na.rm = T))
  set(dt, i = my_rows, j = v, value = 0)
}
dt

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1: 21.0   0   0.0   0 3.90 2.620 16.46  0  1    4    4
 2: 21.0   0   0.0   0 3.90 2.875 17.02  0  1    4    4
 3: 22.8   0   0.0   0 3.85 2.320 18.61  1  1    4    1
 4: 21.4   0 258.0   0 0.00 3.215 19.44  1  0    3    1
 5:  0.0   8 360.0 175 0.00 3.440 17.02  0  0    3    2
 6:  0.0   0   0.0   0 0.00 3.460 20.22  1  0    3    1
 7:  0.0   8 360.0 245 0.00 3.570 15.84  0  0    3    4
 8: 24.4   0   0.0   0 3.69 3.190 20.00  1  0    4    2
 9: 22.8   0   0.0   0 3.92 3.150 22.90  1  0    4    2
10:  0.0   0   0.0   0 3.92 3.440 18.30  1  0    4    4
11:  0.0   0   0.0   0 3.92 3.440 18.90  1  0    4    4
12:  0.0   8 275.8 180 0.00 4.070 17.40  0  0    3    3
13:  0.0   8 275.8 180 0.00 3.730 17.60  0  0    3    3
14:  0.0   8 275.8 180 0.00 3.780 18.00  0  0    3    3
15:  0.0   8 472.0 205 0.00 5.250 17.98  0  0    3    4
16:  0.0   8 460.0 215 0.00 5.424 17.82  0  0    3    4
17:  0.0   8 440.0 230 0.00 5.345 17.42  0  0    3    4
18: 32.4   0   0.0   0 4.08 2.200 19.47  1  1    4    1
19: 30.4   0   0.0   0 4.93 1.615 18.52  1  1    4    2
20: 33.9   0   0.0   0 4.22 1.835 19.90  1  1    4    1
21: 21.5   0   0.0   0 3.70 2.465 20.01  1  0    3    1
22:  0.0   8 318.0 150 0.00 3.520 16.87  0  0    3    2
23:  0.0   8 304.0 150 0.00 3.435 17.30  0  0    3    2
24:  0.0   8 350.0 245 3.73 3.840 15.41  0  0    3    4
25:  0.0   8 400.0 175 0.00 3.845 17.05  0  0    3    2
26: 27.3   0   0.0   0 4.08 1.935 18.90  1  1    4    1
27: 26.0   0   0.0   0 4.43 2.140 16.70  0  1    5    2
28: 30.4   0   0.0   0 3.77 1.513 16.90  1  1    5    2
29:  0.0   8 351.0 264 4.22 3.170 14.50  0  1    5    4
30:  0.0   0   0.0 175 3.62 2.770 15.50  0  1    5    6
31:  0.0   8 301.0 335 0.00 3.570 14.60  0  1    5    8
32: 21.4   0   0.0   0 4.11 2.780 18.60  1  1    4    2
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Note that this requires which() because the i argument to set() only accepts integers. I suspect that the approach which performs fastest will depend on the size of your data and how complex the subsetting you want to do is. One (minor) advantage of this option is that it avoids the overhead of calling [.data.table()

dlaggy
  • 173
  • 8