Questions tagged [oversampling]

Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

156 questions
35
votes
1 answer

Using Smote with Gridsearchcv in Scikit-learn

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to…
16
votes
5 answers

SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors

I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow letter clerk crown... 1 1909 ask secretari state war…
Dbercules
  • 629
  • 1
  • 9
  • 26
11
votes
1 answer

Duplicating training examples to handle class imbalance in a pandas data frame

I have a DataFrame in pandas that contain training examples, for example: feature1 feature2 class 0 0.548814 0.791725 1 1 0.715189 0.528895 0 2 0.602763 0.568045 0 3 0.544883 0.925597 0 4 0.423655 0.071036 0 5…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
10
votes
3 answers

Use SMOTE to oversample image data

I'm doing a binary classification with CNNs and the data is imbalanced where the positive medical image : negative medical image = 0.4 : 0.6. So I want to use SMOTE to oversample the positive medical image data before training. However, the…
9
votes
2 answers

Weighted random sampler - oversample or undersample?

Problem I am training a deep learning model in PyTorch for binary classification, and I have a dataset containing unbalanced class proportions. My minority class makes up about 10% of the given observations. To avoid the model learning to just…
clueless
  • 211
  • 2
  • 3
  • 7
6
votes
3 answers

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but…
6
votes
2 answers

Oversampling or SMOTE in Pyspark

I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority…
Surbhi Jain
  • 107
  • 1
  • 2
  • 5
6
votes
1 answer

How to apply SMOTE technique (oversampling) before word embedding layer

How to apply SMOTE algorithm before word embedding layer in LSTM. I have a problem of text binary classification (Good(9500) or Bad(500) review with total of 10000 training sample and it's unbalanced training sample), mean while i am using LSTM with…
user1531248
  • 521
  • 1
  • 5
  • 17
5
votes
1 answer

SMOTE function not working in make_pipeline

I wanna simultaneously apply cross-validation and over-sampling. I get the following error from this code: from sklearn.pipeline import Pipeline, make_pipeline imba_pipeline = make_pipeline(SMOTE(random_state=42), …
5
votes
1 answer

Upsampling: insert extra values between each consecutive elements of a vector

Suppose we a have a vector V consisting of 20 floating point numbers. Is it possible to insert values between each pair of these floating points such that vector V becomes a vector of exactly 50 numbers. The inserted value should be a random number…
student_11
  • 142
  • 6
5
votes
1 answer

How to resample text (imbalanced groups) in a pipeline?

I'm trying to do some text classification using MultinomialNB, but I'm running into problems because my data is unbalanced. (Below is some sample data for simplicity. In actuality, mine is much larger.) I'm trying to resample my data using…
4
votes
1 answer

TypeError: __init__() got an unexpected keyword argument 'ratio' when using SMOTE

I am using SMOTE to oversample as my dataset is imbalanced. I am getting an unexpected argument error. But in the documentation, the ratio argument is defined for SMOTE. Can someone help me understand where I am going wrong? Code snippet from…
anushiya-thevapalan
  • 561
  • 3
  • 5
  • 16
3
votes
2 answers

Over and under sample multi-class training examples (rows) in a pandas dataframe to specified values

I would like to make a multi-class pandas dataframe more balanced for training. A simplified version of my training set looks as follows: Imbalanced dataframe: counts for class 0, 1 and 2 are respectively 7, 3 and 1 animal class 0 dog1 …
Simon
  • 33
  • 3
3
votes
3 answers

Imbalanced Image Dataset (Tensorflow2)

I'm trying to do a binary image classification problem, but the two classes (~590 and ~5900 instances, for class 1 and 2, respectively) are heavily skewed, but still quite distinct. Is there any way I can fix this, I want to try SMOTE/random…
3
votes
2 answers

Oversampling a sparse dataset in Python

I have a dataset that has a multi-labeled data. There is a total of 20 labels (from 0 to 20) which has an imbalance distribution among them. Here is an overview of the data: |id |label|value | |-----|-----|------------| |95534|0 …
LoneWolf
  • 79
  • 6
1
2 3
10 11