2

Hope my title makes sense. I have a dataframe with a column of numeric values, and I would like to use this column to create a new column whereby the numeric values are 'mapped' to different buckets based on their values. Below is some test data, as well as a rough-around-the-edges nested ifelse() approach that I am currently using to solve this problem. I am hoping to code this in a better way that doesn't involve nested ifelse() statements, since this approach doesn't scale well for many buckets:

mydf = data.frame(strings = letters[1:10], 
              numerics = c(0.2, 0.4, 1.3, 5.2, 3.3, 2.1, 7.3, 1.1, 4.3, 8.3),
              stringsAsFactors = FALSE)

Here is my test dataframe, and here is my nested ifelse() approach to solving my problem:

mydf$buckets = ifelse(mydf$numerics <= 2, 0, 
                   ifelse(mydf$numerics <= 4, 1, 
                       ifelse(mydf$numerics <= 5, 2, 
                            ifelse(mydf$numerics <= 7, 3, 4))))

What the above code does is maps values in the numeric column as follows:

  • all values <2 go to 0
  • all values <4 go to 1
  • all values <5 go to 2
  • all values <7 go to 3
  • all values >= 7 to go 4

this approach doesn't scale well for more than a small number of buckets. any help with this is appreciated! Thanks,

Canovice
  • 9,012
  • 22
  • 93
  • 211
  • 3
    `cut(mydf$numerics, breaks = c(0, 2, 4, 5, 7, 10), labels = c(0, 1, 2, 3, 4))` – Ronak Shah Sep 05 '17 at 02:42
  • 1
    I think that you will be interested in `case_when`from the `dplyr` package. It scales up well and the code is clear. – tic-toc-choc Sep 05 '17 at 02:46
  • 1
    `cut` from base R is specifically designed for this case. For non-monotonic mappings, you could create a look-up table and merge. – MichaelChirico Sep 05 '17 at 02:47
  • Although there are different, unique answers here that work, I agree that my question is a duplicate of the other question linked. It appears this is exactly what cut() is for, as Michael mentioned. I like case_when as well though – Canovice Sep 05 '17 at 04:14

2 Answers2

4

try using the findInterval function in base R:

 findInterval(mydf$numerics,c(2,4,5,7))
   [1] 0 0 0 3 1 1 4 0 2 4
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • this is great, a more exact solution to my question than cut(), although cut() looks to have more use cases – Canovice Sep 05 '17 at 04:14
3

I really like using case_when in this sort of situation as already mentioned by @tictocchoc in the comments:

suppressPackageStartupMessages(library(tidyverse))

mydf = data.frame(strings = letters[1:10], 
                  numerics = c(0.2, 0.4, 1.3, 5.2, 3.3, 2.1, 7.3, 1.1, 4.3, 8.3),
                  stringsAsFactors = FALSE)

mydf %>%
  mutate(buckets = case_when(
    numerics < 2 ~0,
    numerics < 4 ~1,
    numerics < 5 ~2,    
    numerics < 7 ~3,
    numerics >= 7 ~4
  ))
#>    strings numerics buckets
#> 1        a      0.2       0
#> 2        b      0.4       0
#> 3        c      1.3       0
#> 4        d      5.2       3
#> 5        e      3.3       1
#> 6        f      2.1       1
#> 7        g      7.3       4
#> 8        h      1.1       0
#> 9        i      4.3       2
#> 10       j      8.3       4
markdly
  • 4,394
  • 2
  • 19
  • 27
  • yeah i like this solution as well - I think the base R solution is a bit simpler however, but this is a nice solution as well. – Canovice Sep 05 '17 at 04:15