Restrict column to specific range using data.table in R

Question

I'm using the data.table package in R and want to perform an operation on a column. Specifically I want to enforce that all values are (0, 1).

Let's just work with a simple example here:

data = data.table(x = rnorm(10))

My data is stored as a data.table so I was thinking that I could do something like this:

data[, newx := max(min(x, 1), 0)]

but the aggregate functions (min and max) compute the vector min/max.

Okay so I make a change an add a by=.I statement:

data[, newx := max(min(x, 1), 0), by=.I]

but this doesn't work either!

What is the correct way, using data.table, to accomplish this kind of task?

by can use `by = seq_along(x)` – jangorecki Jan 28 '16 at 22:45 — jangorecki, Jan 28 '16 at 22:45

score 3 · Accepted Answer · edited Jan 28 '16 at 20:39

3

You can create a dummy index and drop it when it is no longer needed, like this:

data[,Idx := .I][, newx := max(min(x, 1), 0), by = "Idx"][, Idx := NULL][]

#              x      newx
# 1:  1.12585452 1.0000000
# 2:  0.82343338 0.8234334
# 3: -1.02227889 0.0000000
# 4:  1.42761362 1.0000000
# 5:  0.77371518 0.7737152
# 6: -0.22261010 0.0000000
# 7: -0.64862015 0.0000000
# 8: -0.45663845 0.0000000
# 9: -0.96332902 0.0000000
# 10: -0.04396755 0.0000000

edited Jan 28 '16 at 20:39

Jaap

81,064
34
182
193

answered Jan 28 '16 at 20:28

nrussell

18,382
4
47
60

1

What is this alchemy!? Seriously thank you! But why does this not work with `by=.I`? – Ellis Valentiner Jan 28 '16 at 20:31
1

Mmm you would probably have to dig into the `data.table` source code to confirm, but I think since, in the presence of a `by` clause (where `by != .I`) it maintains an index sequence for each group, it just doesn't have the ability to be used as the grouping variable itself. From the help file, `.I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, it's row location in x. This is useful to subset in j; e.g. DT[, .I[which.max(somecol)], by=grp].` – nrussell Jan 28 '16 at 20:37
Don't see the need to create `Idx`; `data[ , newx := max(min(x, 1), 0), by = seq_along(x)]` appears to work just the same (as pointed out by @jangorecki) – MichaelChirico Jan 29 '16 at 15:13

score 2 · Answer 2 · edited Jan 29 '16 at 15:09

2

You can also try simple ifelse i.e.

data[, newX:= ifelse(x >1,1,x)][, newX:= ifelse(x < 0, 0,x)]

edited Jan 29 '16 at 15:09

MichaelChirico

33,841
14
113
198

answered Jan 29 '16 at 12:40

Kumar Manglam

2,780
1
19
28

Only need one `ifelse` here: `ifelse(x %between% c(0, 1), x, x>0)`. Also, [`ifelse` is slow](http://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow); better to just use three lines of code: `data[ , newX := x]; data[x < 0, newX := 0]; data[x > 1, newX := 1]` – MichaelChirico Jan 29 '16 at 15:15

score 1 · Answer 3 · answered Jan 29 '16 at 15:04

Simpler and faster would be to just define it piecewise:

set.seed(13084)
data = data.table(x = rnorm(10))
> data[ , newx := (xg1 <- x > 1) + x * (!xg1 & x > 0)][]
             x      newx
 1:  0.7842597 0.7842597
 2: -0.3935582 0.0000000
 3: -2.3379063 0.0000000
 4: -1.7428335 0.0000000
 5:  0.1678035 0.1678035
 6: -0.9558911 0.0000000
 7: -1.5592778 0.0000000
 8:  0.9358569 0.9358569
 9:  0.7778178 0.7778178
10:  1.0937594 1.0000000

Restrict column to specific range using data.table in R

3 Answers3