1

I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.

I've got it set up like:

import sklearn
import csv
with open('Letter.csv') as f:
   reader = csv.reader(f)
   annotated_data = [r for r in reader]

and for splitting:

import random  
random.seed(1234)  
random.shuffle(annotated_data)

But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.

Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.

Son Truong
  • 13,661
  • 5
  • 32
  • 58
Alonzo Robbe
  • 465
  • 1
  • 8
  • 23

2 Answers2

2

No, it's possible in scikit-learn to split into three sets directly. The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.

Essentially, the code with data X and y could look like this:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
0

You can use train_test_split twice:

  1. Split the data into a ratio 0.8 : 0.2
  2. Split the smaller set into a ratio 0.5 : 0.5
dim
  • 992
  • 11
  • 26