I have an unbalanced data set with a categorical dependent variable and feature variables that are continuous and categorical. I know that the SMOTE function from the DMwR package can handle only continuous features. Is there package that can handle categorical and continuous features like Chawla describes in his paper?
Asked
Active
Viewed 4,336 times
3
-
Not in R, but it seems that it has been implemented in python. https://stackoverflow.com/questions/47655813/oversampling-smote-for-binary-and-categorical-data-in-python – RLave Mar 25 '19 at 14:07
-
My reading of the paper that you cite covers only continuous features. In particular, note that when describing the "Adult" dataset, they wrote `For SMOTE, we extracted the continuous features and generated a new dataset with only continuous features.` – G5W Mar 25 '19 at 14:12
-
It's under section 6.1 and 6.2 @G5W. But just in theory. – RLave Mar 25 '19 at 14:13
-
@RLave yes, i know but i hope maybe someone can help me out with r implementation – MasterStudent1992 Mar 25 '19 at 14:16
-
Unfortunately, it seems that it's not yet implemented in R. – RLave Mar 25 '19 at 14:17
1 Answers
0
You can handle this in R!
Yes, both smotefamily::SMOTE and DMwR::SMOTE can only handle numeric features because the underlying algorithm is k-nearest neighbors.
Therefore:
convert all categorical variables to datatype
factor
.calculate numeric estimates of each factor level by the very recent package
tidymodels::embed
The tidymodels::embed
package offers three methods to perform step 2:
- step_lencode_glm
- step_lencode_bayes
- step_lencode_mixed
The documentation says that these methods
estimate the effect of each of the factor levels on the outcome and these estimates are used as the new encoding.

Agile Bean
- 6,437
- 1
- 45
- 53