Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
113
votes
6 answers

scikit-learn .predict() default threshold

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability. In a binary classification problem, is scikit's classifier.predict() using 0.5 by default? If it doesn't, what's the default…
ADJ
  • 4,892
  • 10
  • 50
  • 83
18
votes
7 answers

AttributeError: 'SMOTE' object has no attribute '_validate_data'

I'm resampling my data (multiclass) by using SMOTE. sm = SMOTE(random_state=1) X_res, Y_res = sm.fit_resample(X_train, Y_train) However, I'm getting this attribute error. Can anyone help?
HP_17
  • 203
  • 1
  • 4
  • 10
15
votes
4 answers

No module named 'sklearn.neighbors._base'

I have recently installed imblearn package in jupyter using !pip show imbalanced-learn But I am not able to import this package. from tensorflow.keras import backend from imblearn.over_sampling import SMOTE I get the following…
joel
  • 1,156
  • 3
  • 15
  • 42
13
votes
2 answers

XGBoost for multiclassification and imbalanced data

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below. I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight…
6
votes
0 answers

Model Probability Calibration in Pyspark

I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's. Then I applied the model in a…
ekorkmz
  • 61
  • 1
6
votes
3 answers

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but…
5
votes
3 answers

package to do SMOTE in R

I am trying to do SMOTE in R for imbalanced datasets. I tried installing "DMwR" package for this, but it seems this package has been removed from the cran repository. I am getting the error:" package ‘DMwR’ is not available (for R version 4.0.2)…
Triparna Poddar
  • 408
  • 1
  • 4
  • 14
5
votes
2 answers

how to handle unbalanced data for multilabel classification using CNN in Keras?

My dataset shape is (91149, 12) I used CNN to train my classifier in text classification tasks I found Training Accuracy: 0.5923 and Testing Accuracy: 0.5780 My Class has 9 labels as below: df['thematique'].value_counts() Corporate …
4
votes
1 answer

Error : "Number of classes in y_true not equal to the number of columns in 'y_score'"

i have an imbalanced multiclass dataset , when i try to compute the roc_auc_score i get this error: ValueError: Number of classes in y_true not equal to the number of columns in 'y_score'. here is the code: model = svm.SVC(kernel='linear',…
4
votes
2 answers

Custom loss function (focal loss) input size error in Keras

I am using a neutral network to do multi-class classification. There're 3 imbalanced classes so I'd like to use the focal loss to handle the in-balance. So I use custom loss function to fit in Keras sequential model. I tried multiple versions of…
4
votes
0 answers

R-caret : how to use class weights along with downSample to deal with class imbalance issue?

I have a hugely imbalanced data set. To deal with this issue, I tried separately different class-imbalance techniques : downSample, class weights, threshold tuning. Among them, threshold tuning was the least effective. Using downSample alone or…
Basilique
  • 150
  • 1
  • 11
4
votes
2 answers

Multilabel classification with class imbalance in Pytorch

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.…
4
votes
2 answers

Using imbalanced-learn with Pandas DataFrame

My dataset is quite imbalanced. The two minority classes each contain half of the sample in the majority class. My RNN model is not able to learn anything about the least populated class. I'm trying to use the imbalanced-learn library. For…
3
votes
1 answer

Class_weight and sample_weight ineffective for sklearn Random Forest

I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC…
RB10
  • 43
  • 4
3
votes
1 answer

How to use "is_unbalance" and "scale_pos_weight" parameters in LightGBM for a binary classification project that is unbalanced (80:20)

I am currently having an imbalanced dataset as shown diagram below: Then, I use the 'is_unbalance' parameter by setting it to True when training the LightGBM model. Diagrams below show how I use this parameter. Example of using native API: Example…
1
2 3
23 24