-1

This is going to be a long shot but i'll try anyway. I want to build a centile (100 groups) or decile (10 groups) based on the data.frame available. In this example, I have a data frame with 891 records. In this data.frame, I have the following variables.

  • Unique_ID (numerical). i.e. unique member number
  • xbeta (numerical) Given credit score. (which allows ranking to be performed)
  • Good (numerical). Binary Flag (0 or 1). An indicator if member is delinquent
  • Bad (numerical). Binary Flag (0 or 1) inverse of good

I need your help to build an equivalent table below. By changing the number of groups, i'd be able to split it either 10 or by 100 using xbeta. With the top row being the total (identifiable via TYPE), i'd like to produce the following table (see table below for more details)

  • r_xbeta is just row number based on the # of groups.
  • TYPE to identify total or group rank
  • n = Total Count
  • count of Good | Bad flag within the rank
  • xbeta stats, min | max | mean | median
  • GB_Odds = GOOD / BAD for the rank
  • LN_GB_ODDs = Log(GB_Odds)
  • rest should be self explanatory

Your help is much appreciated. Jim learning R

r_xbeta _TYPE_  n   GOOD    BAD xbeta_min   xbeta_max   xbeta_mean  xbeta_MEDIAN    GB_ODDS LN_GB_ODDS  Cummu_Good  Cummu_Bad   Cummu_Good_pct  Cummu_Bad_pct
.       0       891 342     549 -4.42       3.63        -0.7        -1.09           0.62295 -0.47329    342         549         100%            100%
0       1       89  4       85  -4.42       -2.7        -3.6        -3.57           0.04706 -3.05636    4           85          1.20%           15%
1       1       89  12      77  -2.69       -2.37       -2.55       -2.54           0.15584 -1.8589     16          162         4.70%           30%
2       1       87  12      75  -2.35       -1.95       -2.16       -2.2            0.16    -1.83258    28          237         8.20%           43%
3       1       93  14      79  -1.95       -1.54       -1.75       -1.79           0.17722 -1.73039    42          316         12%             58%
4       1       88  10      78  -1.53       -1.09       -1.33       -1.33           0.12821 -2.05412    52          394         15%             72%
5       1       89  27      62  -1.03       -0.25       -0.67       -0.69           0.43548 -0.8313     79          456         23%             83%
6       1       89  44      45  -0.24       0.33        0.05        0.03            0.97778 -0.02247    123         501         36%             91%
7       1       89  54      35  0.37        1.07        0.66        0.63            1.54286 0.43364     177         536         52%             98%
8       1       88  77      11  1.08        2.15        1.56        1.5             7       1.94591     254         547         74%             100%
9       1       90  88      2   2.18        3.63        2.77        2.76            44      3.78419     342         549         100%            100%

Choc_waffles
  • 518
  • 1
  • 4
  • 15

2 Answers2

0

A reproducible example would be great, i.e. something we can copy-paste to our terminal that demonstrates your problem. For example, here is the dataframe I'll work with:

set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
                           xbeta=rnorm(891, sd=10),
                           Good=round(runif(891) < 0.5),
                           Bad=round(runif(891) < 0.5))
head(my_dataframe)
#   Unique_ID     xbeta Good Bad
# 1         1 -6.264538    1   0
# 2         2  1.836433    1   0
# 3         3 -8.356286    0   1
# 4         4 15.952808    1   1
# 5         5  3.295078    1   0
# 6         6 -8.204684    1   1

(The particular numbers don't matter to your question which is why I made up random ones).

The idea is to:

  • work out which quantile each row belongs to: see ?quantile. You can specify which quantiles you want (I've shown deciles)

    quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
    #          0%         10%         20%         30%         40%         50%         60%         70%         80%         90%        100% 
    # -30.0804860 -13.3880074  -8.7326454  -5.1121923  -3.0097613  -0.4493361   2.3680366   5.3732613   8.7867326  13.2425863  38.1027668 
    

    This gives the quantile cutoffs; if you use cut on these you can add a variable that says which quantile each row is in (?cut):

    my_dataframe$quantile <- cut(my_dataframe$xbeta,
                                 quantile(my_dataframe$xbeta, seq(0, 1, by=.1)))
    

    Have a look at head(my_dataframe) to see what this did. The quantile column is a factor.

  • split up your dataframe by quantile, and calculate the stats for each. You can use the plyr, dplyr or data.table packages for this; I recommend one of the first two as you are new to R. If you need to do massive merges and calculations on huge tables efficiently (thousands of rows) use data.table, but the learning curve is much steeper. I will show you plyr purely because it's the one I find easiest. dplyr is very similar, but just has a different syntax.

    # The idea: `ddply(my_dataframe, .(quantile), FUNCTION)` applies FUNCTION
    #  to each subset of `my_dataframe`, where we split it up into unique
    # `quantile`s.
    # For us, `FUNCTION` is `summarize`, which calculates summary stats
    #  on each subset of the dataframe.
    # The arguments after `summarize` are the new summary columns we
    #  wish to calculate.
    
    library(plyr)
    output = ddply(my_dataframe, .(quantile), summarize,
                   n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
                   xbeta_min=min(xbeta), xbeta_max=max(xbeta),
                   GB_ODDS=GOOD/BAD) # you can calculate the rest yourself,
                                     # "the rest should be self explanatory".
    > head(output, 3)
           quantile  n GOOD BAD  xbeta_min  xbeta_max   GB_ODDS
     1 (-30.1,-13.4] 89   41  39 -29.397737 -13.388007 1.0512821
     2 (-13.4,-8.73] 89   49  45 -13.353714  -8.732645 1.0888889
     3 (-8.73,-5.11] 89   46  48  -8.667335  -5.112192 0.9583333
    
  • Calculate the other columns. See (E.g.) ?cumsum for cumulative sums. e.g. output$cummu_good <- cumsum(output$GOOD).
  • Add the 'total' row. You should be able to do this. You can add an extra row to output using rbind.
Community
  • 1
  • 1
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Awesome. Tested everything, and it worked beautifully. But how do I do the Cummu_Good_pct? – Choc_waffles Aug 03 '15 at 12:30
  • You can work it out from what I've given you. Calculate a percentage, then see `?cumsum` to do a cumulative sum. – mathematical.coffee Aug 03 '15 at 12:32
  • Thx math coffee. You make my transition to R easier – Choc_waffles Aug 03 '15 at 21:40
  • How do I deal with the error type "'breaks' are not unique" when I cut the quantile into my_dataframe$quantile? the 'unique' solution in the link below doesn't help. http://stackoverflow.com/questions/16184947/cut-error-breaks-are-not-unique doesn't – Choc_waffles Aug 04 '15 at 11:24
0

Here is the final version my script with math coffee's guidance. I had to use .bincode instead of the suggested cut due to "'breaks' are not unique" error.

Thanks everyone.

set.seed(1) # so you get the same random numbers as me
my_dataframe <- data.frame(Unique_ID = 1:891,
                           xbeta=rnorm(891, sd=10),
                           Good=round(runif(891) < 0.5),
                           Bad=round(runif(891) < 0.5))
head(my_dataframe)
quantile(my_dataframe$xbeta, seq(0, 1, by=.1))
my_dataframe$quantile = .bincode(my_dataframe$xbeta,quantile(my_dataframe$xbeta,seq(0,1,by=.1)))
library(plyr)
output = ddply(my_dataframe, .(quantile), summarize,
               n=length(Unique_ID), GOOD=sum(Good), BAD=sum(Bad),
               xbeta_min=min(xbeta), xbeta_max=max(xbeta), xbeta_median=median(xbeta), xbeta_mean=mean(xbeta),
               GB_ODDS=GOOD/BAD, LN_GB_ODDS = log(GOOD/BAD))
output$cummu_good = cumsum(output$GOOD)
output$cummu_bad = cumsum(output$BAD)
output$cummu_n = cumsum(output$n)
output$sum_good = sum(output$GOOD)
output$sum_bad = sum(output$BAD)
output$cummu_good_pct = cumsum(output$GOOD/output$sum_good)
output$cummu_bad_pct = cumsum(output$BAD/output$sum_bad)
output[["sum_good"]]=NULL
output[["sum_bad"]]=NULL
output
Choc_waffles
  • 518
  • 1
  • 4
  • 15