I have some data set. I want to split it into a train and test set. The train set is to hold 2/3 of the data. I want the two sets to be representative of the whole set. In my "Class" column I have either 4 or 2 to represent two classes. I want my test set to have the same ratio of 4:2. In order to do this I created this snippet of code:
trainTotal = 455
benTotal = 296
malTotal = 455-296
b = 0
m = 0
tr = 0
i = 0
j = 0
for index, row in data.iterrows():
if row['Class'] == 2:
if tr < trainTotal and b < benTotal:
train.loc[i] = data.iloc[index]
b = b+1
tr = tr + 1
i = i+1
else:
test.loc[j] = data.iloc[index]
j = j+1
if row['Class'] == 4:
if tr < trainTotal and m < malTotal:
train.loc[i] = data.iloc[index]
tr = tr + 1
i = i + 1
m = m+1
else:
test.loc[j] = data.iloc[index]
j = j + 1
I am getting the correct toal number of values inside my train dataframe, but the cases are not represented as I had hoped. It is entering into if tr < trainTotal and b < benTotal:
too many times. Any idea what the issue may be?