Questions tagged [smote]

Smote is an abbreviation for Synthetic Minority Oversampling TEchnique. This tag refers to the oversampling method used commonly in machine learning to balance the class distributions in datasets by introducing new minority class examples.

In machine learning, most classifiers works assuming that the classes given in the training set are roughly balanced. When classes are imbalanced, classifiers tend towards predicting the majority class.

One way to overcome this is to carry out an interpolation among neighboring minority class instances and generate artificial samples.

Useful references:

One of the earlier publications on SMOTE: chawla et al 2002

One review on SMOTE: Fernández et al 2017

Influence of datasets on SMOTTE: Skryjomski et al 2017

Python toolbox for imbalanced datasets: Lemaˆıtre et al 2017

185 questions
11
votes
1 answer

How to split data based on a column value in sklearn

I have a data file with following columns 'customer', 'calibrat' - Calibration sample = 1; Validation sample = 0; 'churn', 'churndep', 'revenue', 'mou', Data file contains some 40000 rows out of which 20000 have value for calibrat as 1. I want to…
6
votes
4 answers

Getting error: KeyError: 'Only the Series name can be used for the key in Series dtype mappings.' when trying to do pandas Smote algorithm

My data is slightly unbalanced, so I am trying to do a SMOTE algorithm before doing the logistic regression model. When I do, I get the error: KeyError: 'Only the Series name can be used for the key in Series dtype mappings.' Could someone help me…
devdon
  • 101
  • 1
  • 1
  • 4
5
votes
3 answers

package to do SMOTE in R

I am trying to do SMOTE in R for imbalanced datasets. I tried installing "DMwR" package for this, but it seems this package has been removed from the cran repository. I am getting the error:" package ‘DMwR’ is not available (for R version 4.0.2)…
Triparna Poddar
  • 408
  • 1
  • 4
  • 14
5
votes
2 answers

Retain pandas dataframe structure after SMOTE, oversampling in python

Problem: While implementing SMOTE (a type of oversampling) , my dataframe is getting converted to numpy array). Test_train_split from sklearn.model_selection import train_test_split X_train, X_test, y_train_full, y_test_full = train_test_split(X,…
noob
  • 3,601
  • 6
  • 27
  • 73
5
votes
1 answer

SMOTE function not working in make_pipeline

I wanna simultaneously apply cross-validation and over-sampling. I get the following error from this code: from sklearn.pipeline import Pipeline, make_pipeline imba_pipeline = make_pipeline(SMOTE(random_state=42), …
4
votes
1 answer

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

I already referred the posts here, here and here. Don't mark it as duplicate. I am working on a binary classification problem where my dataset has categorical and numerical columns. However, some of the categorical columns has a mix of numeric and…
The Great
  • 7,215
  • 7
  • 40
  • 128
4
votes
1 answer

How can I use SMOTE in a Sklearn Pipeline for a NLP Classification problem?

I'm dealing with a multiclass classification problem, in which some classes are very imbalanced. My data looks like this: product_description class "This should be used to clean..." 1 "Beauty product, natural..." …
dekio
  • 810
  • 3
  • 16
  • 33
4
votes
3 answers

SMOTE - could not convert string to float

I think I'm missing something in the code below. from sklearn.model_selection import train_test_split from imblearn.over_sampling import SMOTE # Split into training and test sets # Testing Count Vectorizer X = df[['Spam']] y =…
Math
  • 191
  • 2
  • 5
  • 19
4
votes
1 answer

TypeError: __init__() got an unexpected keyword argument 'ratio' when using SMOTE

I am using SMOTE to oversample as my dataset is imbalanced. I am getting an unexpected argument error. But in the documentation, the ratio argument is defined for SMOTE. Can someone help me understand where I am going wrong? Code snippet from…
anushiya-thevapalan
  • 561
  • 3
  • 5
  • 16
4
votes
2 answers

SMOTE with multiple bert inputs

I'm building a multiclass text classification model using Keras and Bert (HuggingFace), but I have a very imbalanced dataset. I've used SMOTE from Sklearn in order to generate additional samples for the underbalanced classes (I have 45 in total),…
ML_Engine
  • 1,065
  • 2
  • 13
  • 31
3
votes
1 answer

Why does SMOTE not work with more than 15 features / What method does work with more than 15 features?

I'm currently implementing machine learning using SMOTE from imblearn.over_sampling, and as I'm synthesizing data for it, I see a very noticeable cutoff for when the SMOTE method breaks. When I synthesize data using the following code and run it…
3
votes
7 answers

Cannot import name 'available_if' from 'sklearn.utils.metaestimators'

While importing "from imblearn.over_sampling import SMOTE", getting import error. Please check and help. I tried upgrading sklearn, but the upgrade was undone with 'OSError'. Firsty installed imbalance-learn through pip. !pip install -U…
Piyush
  • 31
  • 1
  • 1
  • 2
3
votes
2 answers

Oversampling a sparse dataset in Python

I have a dataset that has a multi-labeled data. There is a total of 20 labels (from 0 to 20) which has an imbalance distribution among them. Here is an overview of the data: |id |label|value | |-----|-----|------------| |95534|0 …
LoneWolf
  • 79
  • 6
3
votes
0 answers

Python - How to differentiate SMOTE resampling from original data

I over sampled my data using SMOTE like so: >>> from imblearn.over_sampling import SMOTE >>> X_resampled, y_resampled = SMOTE().fit_resample(X, y) So now X_resampled, y_resampled are larger than the original data set. How can I tell apart the…
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
3
votes
2 answers

How do we set ratio in SMOTE to have more positive sample than negative sample?

I am trying to use SMOTE to handle imbalanced class data in binary classification, and what I know is: if we use, for example sm = SMOTE(ratio = 1.0, random_state=10) Before OverSampling, counts of label '1': [78] Before OverSampling, counts of…
npm
  • 643
  • 5
  • 17
1
2 3
12 13