One Category Text Classification on imbalanced data-set

Question

I am having imbalanced dataset scraped from web pages text data and have manually classified it into positive class, while the other negative class can have any type of text data which I have marked as negative. Looking at the dataset it was then clear that negative samples are very less approx. 1200 out of 6000.

Negative = 1200

Positive = 4800

Initially with the imbalanced port stemmed dataset the model biased to majority class with high accuracy which was having worst performance in unseen data.

So I took 1200 Negative and 1200 Positive and made it balanced.

I implemented a Dense Model of 64 nodes in 4 layers with regularization of 0.5 using Keras and was able to achieve 60% accuracy in cross-validation while train accuracy goes as high as up to >95%.

Looking at the val_acc and acc I feel that it is totally overfitting after around 20 epochs. In addition to that, it is also not able to generalize due to less number of data rows in the balanced dataset.

What are the ways to tackle such problems?
Can One Class SVM help in single category text classification?
If One Class SVM can help then can anyone provide a basic example or resource for its implementation?

Szymon Maszke · Accepted Answer · 2019-03-06T08:36:43.303

First of all, are you sure there are no positive classes in those 6000 you deemed negative? Rubbish in, rubbish out, make sure it's not the case here.

What are the ways to tackle such problems

In the order I would approach the problem.

Make sure your data representation is good. If you are working with text data you should use word vectors like pretrained word2vec, also available in tensorflow and tensorflow hub (you can find a more advanced approach to word embeddings here like ELMo.
Getting more examples - this one should usually yield the best results (in case the step above is performed), but would take time.
Trying different algorithm - some algorithms don't really care about class imbalance. Decision trees and their variants being the most prominent I think. You should really check them out, starting at simple decision tree, than random forest and boosted trees like xgboost, LightGBM or catboost, the last three should perform quite similar I think, xgboost might be best choice due to abundance of materials on this topic.
Different metrics - accuracy is not the best one, as it's highly motivated by the negative class. Use other metrics like precision and recall and focus on the latter (as your algorithm probably does not find enough positive classes).
Weighted loss - error made on positive examples would be weighted higher than the one on negative examples. I like it better than the next ones, as the model tries to accomodate to data. Here is an example of custom loss in Tensorflow.
Upsampling - reverse of what you did, giving your model same positive examples multiple times (each 5 times in this case, so there 6000 positive examples, as much as negatives). You do not lose information, but training takes longer (basically non-existent problem with your 7200 examples total).
Undersampling - what you did here, but you are losing a lot of information about negative class and it's traits. Better for bigger datasets, yours is small.
Creative approaches - it is harder with textual data, if this wasn't the case, you could try dimensionality reduction or other representation of data which could find an underlying cause of difference between positive and negative points. Hardest and probably would not help in your case.

Can One Class SVM help

Doubt it, it's used for outlier detection. 1200 data points out of 7200 should not be considered an outlier. Furthermore it may share a lot of features with the negative class and you couldn't make use of the labeled data you currently have.

If you want to try it anyway, there is an implementation in sklearn here.

thanks for pointing out undersampling. I kept my eye on F1-Score instead of accuracy when I found that my dataset is imbalanced and model is diving towards majority of class. — Pishang Ujeniya, Mar 06 '19 at 10:46

One Category Text Classification on imbalanced data-set

1 Answers1

What are the ways to tackle such problems

Can One Class SVM help