1

I'm wanting to be able put my xvars into equal frequency bins through R and then analyse variable against my response variable in each bin which will help choose what variables I place into my logistic model. I've tried using rbin_equal_freq through package rbin but this defaults to the output of 20 bins which helps on choosing vars but preferably i'd like 5 bins. I'm new to R and modelling and any help would be good even being able to convert the tibble into 5 bins quickly and efficiently.

rbin_equal_freq(df, y, x, bins = 5).
Output:
  lower_cut upper_cut bin_count  good   bad good_rate      woe         iv entropy
      <dbl>     <dbl>     <int> <int> <int>     <dbl>    <dbl>      <dbl>   <dbl>
1  -12.3      -6.97          33     0    33    0      Inf      Inf        NaN    
2   -6.86     -5.15          33     1    32    0.0303   1.43     0.0602     0.196
3   -5.12     -4.09          33     2    31    0.0606   0.709    0.0192     0.330
4   -4.04     -3.18          33     1    32    0.0303   1.43     0.0602     0.196
5   -3.15     -2.62          33     2    31    0.0606   0.709    0.0192     0.330
6   -2.55     -1.99          33     3    30    0.0909   0.270    0.00331    0.439
7   -1.98     -1.32          33     4    29    0.121   -0.0513   0.000135   0.533
8   -1.30     -0.878         33     4    29    0.121   -0.0513   0.000135   0.533
9   -0.878    -0.478         33     2    31    0.0606   0.709    0.0192     0.330
10   -0.463    -0.0775        33     3    30    0.0909   0.270    0.00331    0.439
11   -0.0775    0.447         33     1    32    0.0303   1.43     0.0602     0.196
12    0.449     1.05          33     4    29    0.121   -0.0513   0.000135   0.533
13    1.05      1.65          33     3    30    0.0909   0.270    0.00331    0.439
14    1.65      2.32          33     5    28    0.152   -0.310    0.00542    0.614
15    2.32      2.96          33     2    31    0.0606   0.709    0.0192     0.330
16    2.96      3.59          33     5    28    0.152   -0.310    0.00542    0.614
17    3.62      4.73          33     6    27    0.182   -0.528    0.0171     0.684
18    4.75      5.98          33     8    25    0.242   -0.893    0.0555     0.799
19    5.99      8.12          33     8    25    0.242   -0.893    0.0555     0.799
20    8.13     16.4           29    12    17    0.414   -1.68     0.217      0.978

I've also used different functions created on here to try and create equal frequency bins but am struggling being unfamiliar to the language, so any suggestions would be greatly appreciated.

connni802
  • 11
  • 1

1 Answers1

2

I'm not sure what r_bin_equal is doing... seems weird that it takes two variables not just one--it must be doing something more than just binning a single variable.

Bins of equal frequency have cut points at quantiles. We can write a quick function use quantile to calculate break points and cut to bin data:

bin_equal = function(x, nbin = 5) {
  breaks = quantile(x, probs = seq(0, 1, length.out = nbin + 1), na.rm = TRUE)
  return(cut(x, breaks = breaks, labels = 1:nbin, include.lowest = TRUE))
}

bin_equal(rnorm(20), nbin = 3)
#  [1] 2 1 2 2 3 3 3 1 1 3 3 3 1 2 1 3 2 2 1 1
# Levels: 1 2 3

Note that this will return a factor.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    `findInterval(x, breaks, all.inside = TRUE)` instead of `cut` returns a vector of integers matching the output of `bin_equal`. – Rui Barradas May 05 '20 at 04:06
  • Thanks! How would I apply this to all my columns/xvariable using apply function set and then analyse against my response/binary variable? Struggling to get my head around the apply familt – connni802 May 05 '20 at 06:20
  • If you want to apply to all columns in a data frame named `df`, you can do `df[] = lapply(df, bin_equal)`. Or `for(col in 1:ncol(df)) df[, col] = bin_equal(df[, col])`. If you want to understand `*apply` better, I'd strongly recommend [this FAQ](https://stackoverflow.com/q/3505701/903061). – Gregor Thomas May 05 '20 at 06:44