Trying to split my datafame into a representative train and test set

Question

I have some data set. I want to split it into a train and test set. The train set is to hold 2/3 of the data. I want the two sets to be representative of the whole set. In my "Class" column I have either 4 or 2 to represent two classes. I want my test set to have the same ratio of 4:2. In order to do this I created this snippet of code:

trainTotal = 455
benTotal = 296
malTotal = 455-296
b = 0
m = 0
tr = 0
i = 0
j = 0

for index, row in data.iterrows():
    if row['Class'] == 2:
        if tr < trainTotal and b < benTotal:
            train.loc[i] = data.iloc[index]
            b = b+1
            tr = tr + 1
            i = i+1
        else:
            test.loc[j] = data.iloc[index]
            j = j+1
    if row['Class'] == 4:
        if tr < trainTotal and m < malTotal:
            train.loc[i] = data.iloc[index]
            tr = tr + 1            
            i = i + 1
            m = m+1
        else:
            test.loc[j] = data.iloc[index]
            j = j + 1

I am getting the correct toal number of values inside my train dataframe, but the cases are not represented as I had hoped. It is entering into if tr < trainTotal and b < benTotal: too many times. Any idea what the issue may be?

Why not use sklearn train_test_split? https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn — Michael Gardner, Oct 10 '19 at 04:50
I thought if I use train_test_split it will just split the data how I specify. I was worried that the data would not be representative of the class column the way I want it to be — new_programmer_22, Oct 10 '19 at 04:52
@newwebdev22 Read the link I provided. Plenty of options to stratify your data. — Michael Gardner, Oct 10 '19 at 04:55
Yes it only has column names. I checked the size before trying to add things to it — new_programmer_22, Oct 10 '19 at 05:00
I used the `train_test_split` from `sklearn` as @MichaelGardner suggested, and to make sure it splits the class evenly, I put it into a `for` loop of let's say `i in range(30)`, used the `i` as the `random_state` parameter, printed the classes ratio for each `i`, and then chose the `random_state` who gave the closest ratio to the complete data ratio. — Aryerez, Oct 10 '19 at 05:28

score 1 · Accepted Answer · answered Oct 10 '19 at 08:03

Like Michael Gardner said, train_test_splitis the function you're looking for.

By default it'll split randomly, but you can use stratify to tell it that you want the same ratio for your Class column in the train and test datasets.

It works like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data,
    target,
    test_size = 0.3,
    stratify=data[['your_column']]
)

Trying to split my datafame into a representative train and test set

1 Answers1