I am having imbalanced dataset scraped from web pages text data and have manually classified it into positive class, while the other negative class can have any type of text data which I have marked as negative. Looking at the dataset it was then clear that negative samples are very less approx. 1200 out of 6000.
Negative = 1200
Positive = 4800
Initially with the imbalanced port stemmed dataset the model biased to majority class with high accuracy which was having worst performance in unseen data.
So I took 1200 Negative and 1200 Positive and made it balanced.
I implemented a Dense Model of 64 nodes in 4 layers with regularization of 0.5 using Keras and was able to achieve 60% accuracy in cross-validation while train accuracy goes as high as up to >95%.
Looking at the val_acc
and acc
I feel that it is totally overfitting after around 20 epochs. In addition to that, it is also not able to generalize due to less number of data rows in the balanced dataset.
- What are the ways to tackle such problems?
- Can One Class SVM help in single category text classification?
- If One Class SVM can help then can anyone provide a basic example or resource for its implementation?