Map numerics to categorical values in R, based on different ranges for the numerics

Question

Hope my title makes sense. I have a dataframe with a column of numeric values, and I would like to use this column to create a new column whereby the numeric values are 'mapped' to different buckets based on their values. Below is some test data, as well as a rough-around-the-edges nested ifelse() approach that I am currently using to solve this problem. I am hoping to code this in a better way that doesn't involve nested ifelse() statements, since this approach doesn't scale well for many buckets:

mydf = data.frame(strings = letters[1:10], 
              numerics = c(0.2, 0.4, 1.3, 5.2, 3.3, 2.1, 7.3, 1.1, 4.3, 8.3),
              stringsAsFactors = FALSE)

Here is my test dataframe, and here is my nested ifelse() approach to solving my problem:

mydf$buckets = ifelse(mydf$numerics <= 2, 0, 
                   ifelse(mydf$numerics <= 4, 1, 
                       ifelse(mydf$numerics <= 5, 2, 
                            ifelse(mydf$numerics <= 7, 3, 4))))

What the above code does is maps values in the numeric column as follows:

all values <2 go to 0
all values <4 go to 1
all values <5 go to 2
all values <7 go to 3
all values >= 7 to go 4

this approach doesn't scale well for more than a small number of buckets. any help with this is appreciated! Thanks,

`cut(mydf$numerics, breaks = c(0, 2, 4, 5, 7, 10), labels = c(0, 1, 2, 3, 4))` — Ronak Shah, Sep 05 '17 at 02:42
I think that you will be interested in `case_when`from the `dplyr` package. It scales up well and the code is clear. — tic-toc-choc, Sep 05 '17 at 02:46
`cut` from base R is specifically designed for this case. For non-monotonic mappings, you could create a look-up table and merge. — MichaelChirico, Sep 05 '17 at 02:47
Although there are different, unique answers here that work, I agree that my question is a duplicate of the other question linked. It appears this is exactly what cut() is for, as Michael mentioned. I like case_when as well though — Canovice, Sep 05 '17 at 04:14

score 4 · Answer 1 · answered Sep 05 '17 at 02:58

4

try using the findInterval function in base R:

 findInterval(mydf$numerics,c(2,4,5,7))
   [1] 0 0 0 3 1 1 4 0 2 4

answered Sep 05 '17 at 02:58

Onyambu

67,392
3
24
53

this is great, a more exact solution to my question than cut(), although cut() looks to have more use cases – Canovice Sep 05 '17 at 04:14

score 3 · Accepted Answer · answered Sep 05 '17 at 02:59

I really like using case_when in this sort of situation as already mentioned by @tictocchoc in the comments:

suppressPackageStartupMessages(library(tidyverse))

mydf = data.frame(strings = letters[1:10], 
                  numerics = c(0.2, 0.4, 1.3, 5.2, 3.3, 2.1, 7.3, 1.1, 4.3, 8.3),
                  stringsAsFactors = FALSE)

mydf %>%
  mutate(buckets = case_when(
    numerics < 2 ~0,
    numerics < 4 ~1,
    numerics < 5 ~2,    
    numerics < 7 ~3,
    numerics >= 7 ~4
  ))
#>    strings numerics buckets
#> 1        a      0.2       0
#> 2        b      0.4       0
#> 3        c      1.3       0
#> 4        d      5.2       3
#> 5        e      3.3       1
#> 6        f      2.1       1
#> 7        g      7.3       4
#> 8        h      1.1       0
#> 9        i      4.3       2
#> 10       j      8.3       4

yeah i like this solution as well - I think the base R solution is a bit simpler however, but this is a nice solution as well. — Canovice, Sep 05 '17 at 04:15

Map numerics to categorical values in R, based on different ranges for the numerics

2 Answers2

Linked