0

Being a beginner-level user of R, despite having read (1) numerous posts about binning&grouping here at SO, and (2) documentation on data.table and dplyr packages, I still can't figure out how to apply the power of those packages for binning continuous&factor variables, for further use in credit scoring modelling.

Problem: To build a code-efficient, easily-customisable, more or less automated solution for binning variables with minimal hard-coding.

These variables used to be binned with a stored procedure (Oracle), but I would like to switch entirely to R and operate the following dataframes:

to bin variables in "df_Raw" according to variable/bin ranges&levels in "binsDF" and to store binned variables in "df_Binned" . So far I have been able to produce simple straight-forward code that is lengthy, error-prone (cut levels and labels are hard-coded), difficult to rollback and is just ugly; although it works.

The goal is to have this binning operation automated as much as possible with minimal hard-coding, so that to re-do binning it would take to update bin ranges&levels in "binsDF" and to re-run the code rather than to manually edit all hard code.

I wonder how **ply family of functions and dplyr functions could be applied nicely to this problem.

Data description - the datasets have 100+ variables and 1-2 mln observations, with two types of variables to be binned:

  1. Continuous variables. Example - OVERDUEAMOUNT - has values 0 (zero), "NA", and both negative&positive numeric values.

OVERDUEAMOUNT needs to be split into 7 bins: bin#1 contains only zeros, bins#2-6 contain continuous values that need to be split into 5 custom-sized intervals, and bin#7 contains only NAs.

  1. Factor variables, with both character and numeric values. Example - PROFESSION - has 4 levels: "NA" and 3 values/codes that stand for certain categories of professions/types of employment.

It is important to place zeros and NAs in 2 separate bins, as they usually have very different interpretation from each other and from other values.

Datasets like iris or GermanCredit are not applicable due to not having NAs, strings or zeros, so I wrote some code below to replicate my data. Many thanks in advance!

Raw data to be binned.

OVERDUEAMOUNT_numbers <- rnorm(10000, mean = 9000, sd = 3000)
OVERDUEAMOUNT_zeros <- rep(0, 3000)
OVERDUEAMOUNT_NAs <- rep(NA, 4000)
OVERDUEAMOUNT <- c(OVERDUEAMOUNT_numbers, OVERDUEAMOUNT_zeros, OVERDUEAMOUNT_NAs)

PROFESSION_f1 <- rep("438", 3000) 
PROFESSION_f2 <- rep("000", 4000)
PROFESSION_f3 <- rep("selfemployed", 5000)
PROFESSION_f4 <- rep(NA, 5000)
PROFESSION <- c(PROFESSION_f1, PROFESSION_f2, PROFESSION_f3, PROFESSION_f4)

ID <- sample(123456789:987654321, 17000, replace = TRUE); n_distinct(ID)

df_Raw <- cbind.data.frame(ID, OVERDUEAMOUNT, PROFESSION) 
colnames(df_Raw) <- c("ID", "OVERDUEAMOUNT", "PROFESSION")

Convert PROFESSION to factor to replicate this variable is processed & prepared for further import into R. Reshuffle the dataframe row-wise to make it look like real data.

df_Raw$PROFESSION <- as.factor(df_Raw$PROFESSION) 
df_Raw <- df_Raw[sample(nrow(df_Raw)), ]

Dataframe with bins.

variable <- c(rep("OVERDUEAMOUNT", 7), rep("PROFESSION", 4))
min <- c(0, c(-Inf, 1500, 4000, 8000, 12000), "", c("438", "000", "selfemployed", ""))
max <- c(0, c(1500, 4000, 8000, 12000, Inf), "", c("438", "000", "selfemployed", ""))
bin <- c(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4))

binsDF <- cbind.data.frame(variable, min, max, bin)
colnames(binsDF) <- c("variable", "min", "max", "bin")

How I bin the variables: copy the list of IDs "as is" in a separate dataframe for further use in semi-joins as a "reference/standard" for the original list of IDs.

dfID <- as.data.frame(df_Raw$ID); colnames(dfID) <- c("ID")

Continuous variable - OVERDUEAMOUNT. Split the variable into 3 temporary dataframes: zeros, NAs and numeric observations to cut.

df_tmp_zeros <- subset(x=df_Raw, subset=(OVERDUEAMOUNT == 0), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_zeros)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_NAs)
df_tmp_numbers <- subset(x=df_Raw, subset=(OVERDUEAMOUNT != 0 & OVERDUEAMOUNT != is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_numbers)
(nrow(df_tmp_zeros) + nrow(df_tmp_NAs) + nrow(df_tmp_numbers)) == nrow(df_Raw) # double-check that all observations are split into 3 parts.

Replace zeros and NAs with an appropriate bin numbers. Specify number of intervals, interval ranges and partition numeric values into bins. Cut the variable into intervals. Merge 3 parts together. Append the binned variable to the final dataframe.

df_tmp_zeros$OVERDUEAMOUNT <- as.factor(1)
df_tmp_NAs$OVERDUEAMOUNT <- as.factor(7)
cuts.OVERDUEAMOUNT <- c(-Inf, 1500, 4000, 8000, 12000, Inf) 
labels.OVERDUEAMOUNT <- c(2:6) 
df_tmp_numbers$OVERDUEAMOUNT <- cut(df_tmp_numbers$OVERDUEAMOUNT, breaks = cuts.OVERDUEAMOUNT, labels = labels.OVERDUEAMOUNT, right = FALSE)

df_tmp_allback <- rbind(df_tmp_zeros, df_tmp_NAs, df_tmp_numbers)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back. 

df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
glimpse(df_semijoin); summary(df_semijoin)

df_Binned <- df_semijoin
str(df_Binned)

Factor variable - PROFESSION. Split the variable into several temporary dataframes: NAs and as many parts as there are other factor levels.

df_tmp_f1 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "438"), select=c(ID, PROFESSION)); nrow(df_tmp_f1)
df_tmp_f2 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "000"), select=c(ID, PROFESSION)); nrow(df_tmp_f2)
df_tmp_f3 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "selfemployed"), select=c(ID, PROFESSION)); nrow(df_tmp_f3)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(PROFESSION)), select=c(ID, PROFESSION)); nrow(df_tmp_NAs)

df_tmp_f1$PROFESSION <- as.factor(1)
df_tmp_f2$PROFESSION <- as.factor(2)
df_tmp_f3$PROFESSION <- as.factor(3)
df_tmp_NAs$PROFESSION <- as.factor(4)

df_tmp_allback <- rbind(df_tmp_f1, df_tmp_f2, df_tmp_f3, df_tmp_NAs)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back. 

df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
str(df_semijoin); summary(df_semijoin)

df_Binned <- cbind(df_Binned, df_semijoin$PROFESSION)
str(df_Binned)

And so on...

P.S. UPDATE: The best solution to this problem is given in this post. roll join with start/end window

These posts are also helpful: How to join (merge) data frames (inner, outer, left, right)? Why does X[Y] join of data.tables not allow a full outer join, or a left join?

The idea is as follows: make a subset from the dataframe with raw data (1 column with unique IDs, 1 column with raw data (values of the variable) and 1 column with the name of variable (use rep() to repeat the variable name as many times are there are observations of the variable; then make a subset from the dataframe with bins with just one variable (as many rows as many bins of that particular variable), and in my case 4 columns - Variable, Min, Max, Bin. See sample code below:

Also I tried foverlaps() from data.table package, but it can't handle NAs; processing of NAs has to be done separately AFAIU; another solution is to use rolling joins but I haven't cracked that yet. Will appreciate advice with the rolling joins.

# Subset numeric variables by variable name. 

rawDF_num_X <- cbind(rawDF2bin$ID, 
                 rep(var_num, times = nrow(rawDF2bin[, vars_num_raw][var_num])), 
                 rawDF2bin[, vars_num_raw][var_num])
colnames(rawDF_num_X) <- c("ID", "Variable", "Value")
rawDF_num_X <- as.data.table(rawDF_num_X)

# Subset table with bins for numeric variables by variable name. 

bins_num_X <- bins_num[bins_num$Variable == var_num, ] 
bins_num_X <- arrange(bins_num_X, Bin) # sort by bin values, in ascending order. 
bins_num_X <- as.data.table(bins_num_X)

# Select and join numeric variables with their corresponding bins using sqldf package. 

vars_num_join <- sqldf("SELECT a.ID, a.Variable, a.Value, b.Min, b.Max, b.Bin 
                        FROM rawDF_num_X AS a, bins_num_X AS b 
                        WHERE a.Variable = b.Variable AND a.Value between b.Min and b.Max
                        OR a.Value IS NULL AND b.Min IS NULL AND b.Max IS NULL") 
View(vars_num_join); dim(vars_num_join)

# Create a TRUE/FALSE flag/check according to the binning conditions. 

vars_num_join$check <- ifelse((is.na(vars_num_join$Value)= TRUE & is.na(vars_num_join$Min) == TRUE & is.na(vars_num_join$Max) == TRUE), "TRUE", 
                               ifelse((vars_num_join$Value == 0 & vars_num_join$Min == 0 & vars_num_join$Max == 0), "TRUE", 
                                      ifelse((vars_num_join$Value != 0 & vars_num_join$Value >= vars_num_join$Min & vars_num_join$Value < vars_num_join$Max), "TRUE", "FALSE")))

# Remove (duplicate) rows that have FALSE flag due to not matching the binning conditions. 

vars_num_join <- vars_num_join[vars_num_join$check == TRUE, ]
identical(rawDF2bin$ID, vars_num_join$ID) # should be TRUE 
Community
  • 1
  • 1
Aktan
  • 11
  • 2
  • 5
  • 1
    Is this really a minimal question/example? – talat Jan 22 '16 at 09:25
  • Sorry, couldn't make reproducible data any shorter. Half is input data, half is my clumsy solution. – Aktan Jan 22 '16 at 09:48
  • @docendo discimus, pls see my previous comment about lengthy post, and another question: how can one use data.table::foverlaps to tackle this task? – Aktan Jan 23 '16 at 07:42

0 Answers0