Balance problem for classification on Cleveland Dataset

Question

I’ve questioned the way famous Cleveland heart disease dataset labels its objects here

This dataset is very unbalanced (many objects of “no disease” class). I noticed that many papers that used this dataset used to combine all the other classes and reduce this to a binary classification (disease vs no disease)

Are there other ways to deal with this unbalancing class problem rather than reduce the number of classes to get a good result from a classifer?

Catalina Chircu · Accepted Answer · 2019-08-03T11:19:58.603

Generally speaking, when handling a non balanced dataset, one should use a non-supervised learning approach.

You may use the Multivariate Normal Distribution. In your case, if you have many elements in one class and very few in the other class, a supervised learning method is not appropriate. Therefore, the Multivariate Normal Distribution, which is a non supervised machine learning approach, may be the solution. The algorithm learns from the data and finds values which define the data (i.e. the most important part of the data, here the "no desease" cases). Once these values are outputed, one can search the elements which do not fit them, and these elements are the so called "abnormal elements" or "anomalies". In your case, these are the "disease" individuals.

A second solution would be to ballance you dataset, and use the initial supervised learning algorithm. You can do that using the following techniques. These statements are generally good, but they depend a lot on the data you have (mind, I do not have access to your input data!), so you should test them and see which one best fits your purpose.

Collecting more elements for the class with few elements.
Duplicate the elements in the class with less elements, in order to obtain the same amount of data for both classes, as for the class with more lements. There is a problem with this solution, in the case where you have a great difference of input data volume between the two classes, and you use a neural network, because the class with duplicated elements will not be very variate, and neural networks provide good results only when trained with a great amount of very variate data.
Use less data in the class with more lements, in order to have the same amount of elements in both classes as in the class with few elements. Here too there might be a problem when using a neural network, because training it with less data might not give the good results. be careful also in order to have more input elements than features, otherwise it would not work.

Oh, my experiment is fully based on supervised learning. Actually I’m using a MLP to classify data — heresthebuzz, Jul 23 '19 at 21:52
I see, but that is not contradictory to what I wrote above. You may use the labels you associate to your elements in order to measure the precision and recall. If you train a supervised model with much more data for one class than for the other, the result of the ML and the predictions will be wrong (you might have good results for the class with more elements) — Catalina Chircu, Jul 24 '19 at 00:59
Not sure if I totally agree with this statement, even though it can definitely be a valid approach. This is a very classical issue in ML, with extensive discussion about it. You may want to look into this: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ — MaximeKan, Jul 24 '19 at 01:05
Answer to MaximeKan: You are certainly right. But I suggested here one approach, among others. One should try several techniques and choose the one that gives the best results for his/hers dataset. @joann2555 : If there are very very little amount of positive examples, you should choose a non-supervisded algorithm after all. Otherwise, try to duplicate your positive examples in order to obtain a balanced dataset. Hoping that this will help. — Catalina Chircu, Jul 28 '19 at 08:44
I updated the answer. I hope it might be useful! If you have time please provide feedback on your results, I wonder which solution you chose. — Catalina Chircu, Aug 03 '19 at 11:21

Balance problem for classification on Cleveland Dataset

1 Answers1