How to split data into 3 parts, one of which wont be used?

Question

I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.

I've got it set up like:

import sklearn
import csv
with open('Letter.csv') as f:
   reader = csv.reader(f)
   annotated_data = [r for r in reader]

and for splitting:

import random  
random.seed(1234)  
random.shuffle(annotated_data)

But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.

Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.

score 2 · Answer 1 · answered Sep 02 '18 at 06:00

No, it's possible in scikit-learn to split into three sets directly. The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.

Essentially, the code with data X and y could look like this:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)

score 0 · Answer 2 · answered Sep 02 '18 at 06:03

0

You can use train_test_split twice:

Split the data into a ratio 0.8 : 0.2
Split the smaller set into a ratio 0.5 : 0.5

answered Sep 02 '18 at 06:03

dim

992
11
26

How to split data into 3 parts, one of which wont be used?

2 Answers2

Related