Cleveland heart disease dataset - can’t describe the class

Question

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.

The dataset description says that the values go from 0 to 4 but the attribute description says:

0: < 50% coronary disease

1: > 50% coronary disease

I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?

score 0 · Answer 1 · answered Jul 23 '19 at 18:23

It basically means that the presence of different heart diseases have been denoted by 1, 2, 3, 4 while the absence is simply denoted by 0. Now, most of the experiments that have been conducted on this dataset have been based on binary classification, i.e. presence(1, 2, 3, 4) vs absence(0). One reason for such behavior might the class imbalance problem(0 has about 160 sample and the rest 1, 2, 3 and 4 make up the other half) and small number of samples(only around 300 total samples). So, it makes sense to treat this data as binary classification problem instead of multi-class classification, given the constraints that we have.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

0

is this dataset meant to be a multiclass or a binary classification problem?

Without changes, the dataset is ready to be used for a multi-class classification problem.
And must i group values 1-4 to a single class (presence of disease)?

Yes, you must, as long as you are interested in using the dataset for a binary classification problem.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 23 '19 at 18:54

sentence

8,213
4
31
40

But how can i handle the balance problem without making it a binary classification? – heresthebuzz Jul 23 '19 at 18:58
I suggest to ask a new question in order to keep this one clean and clear. ;) – sentence Jul 23 '19 at 19:11

score 0 · Accepted Answer · answered Jul 24 '19 at 14:15

0

If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.

You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.

I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.

While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.

References :

https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

answered Jul 24 '19 at 14:15

SUN

181
5

Very interesting idea of simulating this points with SMOTE. Actually I’m very new to this world of machine learning and all my experiments are done with scikit-learn. Do you know if there is any library that does this SMOTE technique? – heresthebuzz Jul 24 '19 at 16:46
Python library Imblearn. from imblearn.over_sampling import SMOTE. Basically once you split dataset in training and test dataset, apply SMOTE algorithms, it will resample dataset. Pass resample dataset to classification algorithm. Please go through the below link, Its describe how to work with imbalanced dataset and use SMOTE sampling algorithms. https://www.kaggle.com/qianchao/smote-with-imbalance-data.\ – SUN Jul 24 '19 at 20:37
Please go through below thread for imbalanced dataset https://stackoverflow.com/questions/57142772/what-is-the-correct-procedure-to-split-the-data-sets-for-classification-problem/57172467#57172467 – SUN Jul 24 '19 at 20:37
Hey your suggestion of using SMOTE gave me very good results. Please make it an answer so i can tag as best answer – heresthebuzz Jul 25 '19 at 15:39
Yeah, SMOTE algorithm give better results with imbalanced data set. I am working on project where I am experimenting with various sampling algorithm. – SUN Jul 25 '19 at 16:05
You should also consider ADASYN, which is a SMOTE-improved version that gives a little bit of variance to the data making it less artificial – heresthebuzz Jul 25 '19 at 16:06

Cleveland heart disease dataset - can’t describe the class

3 Answers3

Linked