1

Are there some ready to use libraries or packages for python or R to reduce the number of levels for large categorical factors?

I want to achieve something similar to R: "Binning" categorical variables but encode into the most frequently top-k factors and "other".

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • You mean replace all "not-frequent" levels as "other"? – s_baldur Aug 21 '16 at 16:46
  • yes this is another way to phrase it because otherwise with several of these high-levelled categorical variables my data matrix blows up in case of one-hot-encoding. – Georg Heiler Aug 21 '16 at 16:51
  • Check this [link](http://stackoverflow.com/questions/38788682/collapsing-factor-level-for-all-the-factor-variable-in-dataframe-based-on-the-co) – Chirayu Chamoli Aug 21 '16 at 17:48

4 Answers4

5

The R package forcats has fct_lump() for this purpose.

library(forcats)
fct_lump(f, n)

Where f is the factor and n is the number of most common levels to be preserved. The remaining are recoded to Other.

Joe
  • 8,073
  • 1
  • 52
  • 58
1

Here is an example in R using data.table a bit, but it should be easy without data.table also.

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Thanks a lot - but why does it no longer work if wrapped in a function call like: reduceCategorical <- function(variableName, min.freq){ # Decide the minimum frequency a level needs... # Levels that don't meet minumum frequency (using data.table) fail.min.f <- neverData[, .N, variableName][N < min.freq, variableName] # Call all these level "Other" levels(neverData[, variableName][fail.min.f]) <- "Other" } Error is: number of levels differs – Georg Heiler Aug 21 '16 at 21:08
  • I couldn't figure it out either. Will keep it at the back of my head until I have more time. Maybe the answer is here: http://stackoverflow.com/questions/11859063/data-table-and-get-command-r?noredirect=1&lq=1 – s_baldur Aug 21 '16 at 21:40
  • Thanks for your help. I raised a separate question for this problem here: http://stackoverflow.com/questions/39071715/r-data-table-usage-in-function-call – Georg Heiler Aug 22 '16 at 05:05
1

I do not think you want to do it in this way. Grouping many levels into one group might make that feature less predictive. What you want to do is put all the levels that would go into Other into a cluster based on a similarity metric. Some of them might cluster with your top-K levels and some might cluster together to give best performance.

I had a similar issue and ended up answering it myself here. For my similarity metric I used the proximity matrix from a random forest regression fit on all features except that one. The difference in my solution is that some of my top-K most common may be clustered together since I use k-mediods to cluster. You would want to alter the cluster algorithm so that your mediods are the top-K you have chosen.

Keith
  • 4,646
  • 7
  • 43
  • 72
  • Interesting approach. From what I have learnt meanwhile, I believe that contrast coding http://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/ is a better fit. – Georg Heiler Apr 27 '17 at 05:01
  • I thought contrast coding only allows the categoricals to be entered into a model. It just converts 1 feature with k levels to k-1 features with 2 levels. I have only used dummy coding, am I missing something? – Keith Apr 27 '17 at 05:34
  • 1
    That is true for binary dummy coding. But as the linked website shows there are many more possibilities. Some methods e.g. let's call this one percentage coding will calculate group / level wise percentages or some other function to turn each level into a numeric value measured from the data. This does not necessarily result in more columns like dummy coding. – Georg Heiler Apr 27 '17 at 05:39
  • OK got it. So it is more of a method for approximating an order and distance between levels. Thanks – Keith Apr 27 '17 at 05:43
  • Please se http://nipy.bic.berkeley.edu/nightly/statsmodels/doc/html/contrasts.html with > In fact, the dummy coding is not technically a contrast coding. This is because the dummy variables add to one and are not functionally independent of the model’s intercept. On the other hand, a set of contrasts for a categorical variable with k levels is a set of k-1 functionally independent linear combinations of the factor level – Georg Heiler May 04 '17 at 04:23
  • Thanks, Ill look into it more. I use k dummies where the {0,0,0,...} is the NULL. My regression model is a boosted decision tree regression so I think in the end it does not matter since it is fundamentally built on binary choices. – Keith May 04 '17 at 19:36
0

Here's an approach using base R:

set.seed(123)
d <- data.frame(x = sample(LETTERS[1:5], 1e5, prob = c(.4, .3, .2, .05, .05), replace = TRUE))

recat <- function(x, new_cat, threshold) {
    x <- as.character(x)
    xt <- prop.table(table(x))
    factor(ifelse(x %in% names(xt)[xt >= threshold], x, new_cat))
}

d$new_cat <- recat(d$x, "O", 0.1)
table(d$new_cat)
#     A     B     C     O 
# 40132 29955 19974  9939 
Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48