-1

I am working on hepatitis dataset from UCI repository. It has imbalanced class. Class Distribution: DIE: 32 LIVE: 123 I am trying to use the ADASYN oversampling method to balance the class.

In the examples they have generated a dataset and passed it to ADASYN. Can someone explain me in that format the value of x and y should be based on the dataset?

I am referring to the example from below link.

https://561-36019880-gh.circle-artifacts.com/0/home/ubuntu/imbalanced-learn/doc/_build/html/generated/imblearn.over_sampling.ADASYN.html#imblearn.over_sampling.ADASYN

Please help me in splitting the dataset as X and Y value as required to pass to ADASYN

1 Answers1

0

Your question is not clear. But this might help:

X - 2D matrix where rows are examples and columns are your features Y - is your response, for example, a 1D vector of True (for class LIVE) and False (for class DIE).

from imblearn.over_sampling import ADASYN
# Apply the random over-sampling
ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(X, y)

X_resampled and y_resampled now include your original data plus the resampled data. Looking at y_resampled you should observe an equal number of labels for each class.

For your reference:

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/examples/over-sampling/plot_adasyn.py

sinapan
  • 948
  • 1
  • 9
  • 23
  • How do we split the dataset as x and y? should X be the columns other than the class label column and y be the class label column?? – Minu Bharatheedasan Feb 01 '18 at 06:14
  • After the split up how do we create a new csv with the balanced dataset. – Minu Bharatheedasan Feb 01 '18 at 06:15
  • if you are using `pandas` to manage your data then you can use [pandas.DataFrame.to_csv] (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) or you can see [this] (https://stackoverflow.com/questions/2084069/create-a-csv-file-with-values-from-a-python-list) on StackOverflow for creating CSV from `python lists`. – sinapan Feb 01 '18 at 16:45