I'm currently implementing machine learning using SMOTE from imblearn.over_sampling, and as I'm synthesizing data for it, I see a very noticeable cutoff for when the SMOTE method breaks. When I synthesize data using the following code and run it through SMOTE (courtesy of Jason Brownlee):
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=15, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
It works fine. However, when the number of features is 16...
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=16, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
SMOTE breaks. Why is this? Does anyone know of a SMOTE method that works for more than 15 parameters? By SMOTE breaking, I mean I get the error below:
Traceback (most recent call last):
File "\\arete\shared\Los Angeles\Users\Active\bbonifacio\New ADVANCE\untitled1.py", line 13, in <module>
X, y = oversample.fit_resample(X, y)
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\imblearn\base.py", line 83, in fit_resample
output = self._fit_resample(X, y)
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\imblearn\over_sampling\_smote\base.py", line 324, in _fit_resample
nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\sklearn\neighbors\_base.py", line 763, in kneighbors
results = PairwiseDistancesArgKmin.compute(
File "sklearn\metrics\_pairwise_distances_reduction.pyx", line 691, in sklearn.metrics._pairwise_distances_reduction.PairwiseDistancesArgKmin.compute
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 151, in threadpool_limits
return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 171, in __init__
self._original_info = self._set_threadpool_limits()
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 268, in _set_threadpool_limits
modules = _ThreadpoolInfo(prefixes=self._prefixes,
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 340, in __init__
self._load_modules()
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 373, in _load_modules
self._find_modules_with_enum_process_module_ex()
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 485, in _find_modules_with_enum_process_module_ex
self._make_module_from_path(filepath)
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 515, in _make_module_from_path
module = module_class(filepath, prefix, user_api, internal_api)
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 606, in __init__
self.version = self.get_version()
File "C:\Users\bbonifacio\Anaconda3\lib\site-packages\threadpoolctl.py", line 646, in get_version
config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
And here are the versions of packages:
Sklearn: 1.1.1 Imblearn: 0.9.1 Threadpoolctl: 2.1.0