0

I have some data set. I want to split it into a train and test set. The train set is to hold 2/3 of the data. I want the two sets to be representative of the whole set. In my "Class" column I have either 4 or 2 to represent two classes. I want my test set to have the same ratio of 4:2. In order to do this I created this snippet of code:

trainTotal = 455
benTotal = 296
malTotal = 455-296
b = 0
m = 0
tr = 0
i = 0
j = 0

for index, row in data.iterrows():
    if row['Class'] == 2:
        if tr < trainTotal and b < benTotal:
            train.loc[i] = data.iloc[index]
            b = b+1
            tr = tr + 1
            i = i+1
        else:
            test.loc[j] = data.iloc[index]
            j = j+1
    if row['Class'] == 4:
        if tr < trainTotal and m < malTotal:
            train.loc[i] = data.iloc[index]
            tr = tr + 1            
            i = i + 1
            m = m+1
        else:
            test.loc[j] = data.iloc[index]
            j = j + 1

I am getting the correct toal number of values inside my train dataframe, but the cases are not represented as I had hoped. It is entering into if tr < trainTotal and b < benTotal: too many times. Any idea what the issue may be?

1 Answers1

1

Like Michael Gardner said, train_test_splitis the function you're looking for.

By default it'll split randomly, but you can use stratify to tell it that you want the same ratio for your Class column in the train and test datasets.

It works like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data,
    target,
    test_size = 0.3,
    stratify=data[['your_column']]
)