3

Is there a way to replace values in a data.frame column that are above or below set threshold values with the max/min threshold values determined by the user in a single step?

The data.table::between() function returns TRUE or FALSE but no indication of whether it's above or below...

See below for MWE. I can get the result in 2 steps but was wondering if there was already a built in function for replacing values above/below the max/min values with the max/min values.

Thanks.

library(data.table)
library(magrittr)

a <- data.table(colA = LETTERS[seq(1,10)],
                colB = 1:10)

the_max <- 7
the_min <- 3

# creates TRUE/FALSE column...
a[, colC := between(colB, the_min, the_max)]
a
#>     colA colB  colC
#>  1:    A    1 FALSE
#>  2:    B    2 FALSE
#>  3:    C    3  TRUE
#>  4:    D    4  TRUE
#>  5:    E    5  TRUE
#>  6:    F    6  TRUE
#>  7:    G    7  TRUE
#>  8:    H    8 FALSE
#>  9:    I    9 FALSE
#> 10:    J   10 FALSE

# gets the result...
a[, colD := colB] %>% 
  .[colD < the_min, colD := the_min] %>% 
  .[colD > the_max, colD := the_max]
a
#>     colA colB  colC colD
#>  1:    A    1 FALSE    3
#>  2:    B    2 FALSE    3
#>  3:    C    3  TRUE    3
#>  4:    D    4  TRUE    4
#>  5:    E    5  TRUE    5
#>  6:    F    6  TRUE    6
#>  7:    G    7  TRUE    7
#>  8:    H    8 FALSE    7
#>  9:    I    9 FALSE    7
#> 10:    J   10 FALSE    7

Created on 2019-08-12 by the reprex package (v0.2.1)

M--
  • 25,431
  • 8
  • 61
  • 93
Prevost
  • 677
  • 5
  • 20

2 Answers2

4

It can be done with pmin/pmax

a[, colD := pmin(pmax(the_min, colB), the_max)]
a
#    colA colB colD
# 1:    A    1    3
# 2:    B    2    3
# 3:    C    3    3
# 4:    D    4    4
# 5:    E    5    5
# 6:    F    6    6
# 7:    G    7    7
# 8:    H    8    7
# 9:    I    9    7
#10:    J   10    7
akrun
  • 874,273
  • 37
  • 540
  • 662
1

In reference to this thread: Replace all values lower than threshold in R

This should be more efficient, however it uses the same logic as akrun's answer.

pmaxmin <- 
  function(x, mmax, mmin) {
     `[<-`(x, x < mmin, mmin) -> y
     `[<-`(y, y > mmax, mmax) -> z
     z
 }

a[, colD := pmaxmin(colB, the_max, the_min)][]

#     colA colB colD
#  1:    A    1    3
#  2:    B    2    3
#  3:    C    3    3
#  4:    D    4    4
#  5:    E    5    5
#  6:    F    6    6
#  7:    G    7    7
#  8:    H    8    7
#  9:    I    9    7
# 10:    J   10    7

p.s. You don't need magrittr to do multiple steps in data.table:

a[, colD := colB][
  colD < the_min, colD := the_min][
    colD > the_max, colD := the_max]

This does the same as your solution with piping.

M--
  • 25,431
  • 8
  • 61
  • 93
  • Thanks. I use `%>%` to keep the lines vertically aligned. Using `[ ]` slowly moves the code to the right (and chaining many lines moves it further...). If you have a solution for that, please share! (although from what I've searched there is none and there is no real speed difference from using `%>%` to `[ ]`. – Prevost Aug 14 '19 at 02:26
  • @Prevost you can have alignment of your choice. I am not sure what you mean. – M-- Aug 14 '19 at 03:04
  • In your code snippet chaining the data.table your variable that you’re creating moves further right each chain that is introduced. The third line colD is further right than the second line colD. – Prevost Aug 14 '19 at 04:06
  • @Prevost that's what I wanted. I could simply remove that extra space. In R indentation is not important. – M-- Aug 14 '19 at 15:13
  • The indent is inserted by default in ‘[ ]’ and I prefer not to have it and do not want to manually delete it so I use ‘%>%’. – Prevost Aug 15 '19 at 01:15
  • @Prevost sorry to point this out. But indent is not inserted by default. That statement is simply just wrong. R does not care about indentation. For what it worth, I could add 10 spaces to the first line and had the second starting before the first. – M-- Aug 15 '19 at 01:17
  • Maybe it’s an RStudio thing with data.table. I’ll check it out again. – Prevost Aug 15 '19 at 01:20
  • @Prevost personal preference is another thing. I have nothing against it, except that it adversely affects performance. Cheers. – M-- Aug 15 '19 at 01:21
  • I'm assuming you used RStudio to create the `data.table` code snippet of `a[, colD := colB] [`...where the second and third lines are indented. You are correct that R does not care about indentation...but RStudio by default (and I couldn't turn it off!) will indent your `data.table` code that is chained with `[ ]`...just as it appears in your code. If I can eliminate the need for `magrittr` I'm all for it, and perhaps in this MWE it wasn't required. The default RGui does not indent data.table code chained with `[ ]`. :) – Prevost Aug 15 '19 at 15:48
  • @Prevost whatever keeps you happy – M-- Aug 15 '19 at 16:10