3

I have an unbalanced data set with a categorical dependent variable and feature variables that are continuous and categorical. I know that the SMOTE function from the DMwR package can handle only continuous features. Is there package that can handle categorical and continuous features like Chawla describes in his paper?

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Not in R, but it seems that it has been implemented in python. https://stackoverflow.com/questions/47655813/oversampling-smote-for-binary-and-categorical-data-in-python – RLave Mar 25 '19 at 14:07
  • My reading of the paper that you cite covers only continuous features. In particular, note that when describing the "Adult" dataset, they wrote `For SMOTE, we extracted the continuous features and generated a new dataset with only continuous features.` – G5W Mar 25 '19 at 14:12
  • It's under section 6.1 and 6.2 @G5W. But just in theory. – RLave Mar 25 '19 at 14:13
  • @RLave yes, i know but i hope maybe someone can help me out with r implementation – MasterStudent1992 Mar 25 '19 at 14:16
  • Unfortunately, it seems that it's not yet implemented in R. – RLave Mar 25 '19 at 14:17

1 Answers1

0

You can handle this in R!

Yes, both smotefamily::SMOTE and DMwR::SMOTE can only handle numeric features because the underlying algorithm is k-nearest neighbors.

Therefore:

  1. convert all categorical variables to datatype factor.

  2. calculate numeric estimates of each factor level by the very recent package tidymodels::embed

The tidymodels::embed package offers three methods to perform step 2:

  • step_lencode_glm
  • step_lencode_bayes
  • step_lencode_mixed

The documentation says that these methods estimate the effect of each of the factor levels on the outcome and these estimates are used as the new encoding.

Agile Bean
  • 6,437
  • 1
  • 45
  • 53