R - pass fixed columns to lapply function in data.table

Question

I have a data.table with columns p1, p2, ... which contains percentages. I want to compute the quantiles for each columns given a reference variable val. Conceptually, this is like:

quantile(val, p1, type = 4, na.rm = T)
quantile(val, p2, type = 4, na.rm = T)
...

My attempt at using data.table is as follows:

fun <- function(x, y) quantile(y, x, type = 4, na.rm = T)
dt[, c('q1', 'q2') := lapply(.SD, fun), .SDcols = c('p1', 'p2'), by = grp]
where grp is some grouping variable

However, I am having trouble specifying the y variable in a way that keeps it fixed.

I tried the following:

fun <- function(x, y, dt) quantile(dt[, y], x, type = 4, na.rm = T)
dt[, c('q1', 'q2') := lapply(.SD, fun, y, dt), .SDcols = c('p1', 'p2'), by = grp]

But doing it this fashion does not enforce the grouping when the quantiles are computed. It will compute the quantile based on the whole range of the y variable instead of the y within groups. What is the correct way to do this?

EDIT:

Here is a trivial example of just one variable:

> dt <- data.table(y = 1:10, p1 = rep(seq(0.2, 1, 0.2), 2), g = c(rep('a', 5), rep('b', 5)))
> dt
     y  p1 g
 1:  1 0.2 a
 2:  2 0.4 a
 3:  3 0.6 a
 4:  4 0.8 a
 5:  5 1.0 a
 6:  6 0.2 b
 7:  7 0.4 b
 8:  8 0.6 b
 9:  9 0.8 b
10: 10 1.0 b
> fun <- function(x, dt, y) quantile(dt[, y], x, type = 4, na.rm = T)
> dt[, c('q1') := lapply(.SD, fun, dt, y), .SDcols = c('p1'), by = c('g')]
> dt
     y  p1 g q1
 1:  1 0.2 a  2
 2:  2 0.4 a  4
 3:  3 0.6 a  6
 4:  4 0.8 a  8
 5:  5 1.0 a 10
 6:  6 0.2 b  2
 7:  7 0.4 b  4
 8:  8 0.6 b  6
 9:  9 0.8 b  8
10: 10 1.0 b 10

You can see q1 is computed using the entire range of y.

Can you post a reproducible example including what `dt` contains. Is the variable `y` really in the same table as the percentages at which you wish to calculate the quantiles? — mnel, Nov 26 '13 at 23:41
`lapply` should be used with a function of one argument. If you need two or more, `mapply` may help. — Frank, Nov 26 '13 at 23:51
@Frank: Could you help to provide an example of how to use `mapply` in the context of `data.table`? In particular, if I specify a function with two arguments, how can I tell `data.table` to loop over one of them while keeping the other one fixed? — , Nov 27 '13 at 00:00
I think standard R recycling works. That is, you can pass a list of length one and another of length n. — Frank, Nov 27 '13 at 00:07
@Frank: would you mind writing out exactly what you had in mind? Sorry but I am new to this data.table thing. — , Nov 27 '13 at 00:16

score 0 · Accepted Answer · edited Oct 28 '19 at 15:24

0

I find the idea that you would store the percentages you require in the same data.table as the data with which you wish to calculate the quantiles very strange, however here is an approach that will work

dt <- data.table(x=10:1,y = 1:10, p1 = rep(seq(0.2, 1, 0.2), 2), g = c(rep('a', 5), rep('b', 5)))


dt[, c('qx','qy') := Map(f = quantile, x = list(x, y), prob = list(p1), type = 4), by = g]

You can use .SDcols on within .SD to select the columns you want

dt[, c('qx','qy') := Map(f = quantile, x = .SD[, .SDcols = c('x','y')], 
                         prob = list(p1), type = 4), by = g]

Or use with =FALSE

dt[, c('qx','qy') := Map(f = quantile, x = .SD[, c('x', 'y')], 
                          prob = list(p1), type = 4), by = g]

edited Oct 28 '19 at 15:24

MichaelChirico

33,841
14
113
198

answered Nov 27 '13 at 00:52

mnel

113,303
27
265
254

Thanks. It works great. Do you happen to know how to specify the `x` by string? I have many columns `vars <- c('x1', 'x2', ...., 'x50')` and I would like to somehow put this in like `x = list(vars)` but it didn't work. – Nov 27 '13 at 07:39
1

I noticed that using `with = FALSE` can be about twice as fast compared to using `.SDcols` inside `.SD`. Using `.SDcols` is the recommended approach when using `lapply` on `.SD`, but that conclusion does not seem to be true when using `.SDcol` inside `.SD`. Do you know why or is this documented somewhere? – Nov 28 '13 at 03:30
Because using `.SDcols` in the inner `.SD` creates `.SD ` twice, and creating `.SD` is slow. – mnel Nov 28 '13 at 03:36
Thanks. This is quite subtle(to me at least). – Nov 28 '13 at 03:41

R - pass fixed columns to lapply function in data.table

1 Answers1

Linked