How to deal with this unbalanced-class skewed data-set?

Question

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.

For example, following is some part of the training data :

93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0

where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.

The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.

Which classifier should be best for handling this kind of data-set ?

I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.

Since this isn't a programming question you're going to get better responses over at [Cross Validated](http://stats.stackexchange.com/) — Tchotchke, Sep 15 '16 at 15:39

score 0 · Accepted Answer · edited May 23 '17 at 12:24

but got no significant improvement in accuracy.

Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.

Some good starting points are:

try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)

EDIT (now knowing you're using scikit-learn)

The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.

The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.

Actually the test data-set given to me has no class-labels, and I have to predict them, and check accuracy from an online judge, hence I think `sklearn.metrics` can't help me. What should I do then ? Is there a way to only predict whether the `class-label` is 0 or not for a given test-sample ? @manlio — Jarvis, Sep 15 '16 at 14:42

score 0 · Answer 2 · answered Sep 16 '16 at 09:55

0

Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.

answered Sep 16 '16 at 09:55

Aditya Patel

569
1
10
28

Can you guide on how to do that ? I am coding in Python. – Jarvis Sep 16 '16 at 13:11
@Jarvis sklearn should have it. Here is one of the link: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html – Aditya Patel Sep 22 '16 at 15:50

How to deal with this unbalanced-class skewed data-set?

2 Answers2