How to split/partition a dataset into training and test datasets for, e.g., cross validation?

Question

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.

pberkes · Accepted Answer · 2021-07-06T08:48:53.267

164

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

edited Jul 06 '21 at 08:48

answered Sep 09 '10 at 14:00

pberkes

5,141
1
24
22

17

thanks for these solutions. But, doesn't the last method, using randint, have a good chance of giving same indices for both test and training sets ? – ggauravr Nov 05 '13 at 22:21
3

The second solution is a valid answer while 1st and 3rd ones are not. For the 1st solution, shuffling the dataset is not always an option, there are many cases where you have to keep the order of data inputs. And the 3rd one could very well produce the same indices for test and training (as pointed out by @ggauravr). – pedram bashiri Sep 17 '19 at 21:02
1

You should _not_ resample for your cross validation set. The entire idea is that the CV set has never been seen by your algo before. The training and test sets are used to fit the data, so of course you'll get good results if you include those in your CV set. I want to upvote this answer because the 2nd solution is what I needed, but this answer has problems. – RubberDuck Apr 23 '20 at 11:39

score 68 · Answer 2 · edited Aug 28 '17 at 15:00

68

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you're trying to split into training and test.

edited Aug 28 '17 at 15:00

dankal444

3,172
1
23
35

answered Aug 31 '13 at 05:45

Paulo Malvar

709
5
4

1

This is a very practical answer, due to realistic handling of both train set and labels. – chinnychinchin Apr 04 '18 at 01:40
It returns a list, not an array. – EngrStudent Oct 26 '20 at 18:09

score 43 · Answer 3 · edited Dec 17 '22 at 18:40

43

Just a note. In case you want the train, test, AND validation sets, you can do this:

from sklearn.model_selection import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.

edited Dec 17 '22 at 18:40

Noam Yizraeli

4,446
18
35

answered May 12 '16 at 17:20

offwhitelotus

1,049
9
15

6

should probably add this to your code: `from sklearn.cross_validation import train_test_split` to make it clear what module you are using – Radix Jul 14 '16 at 20:01
Does this have to be random? – liang Jan 21 '17 at 12:21
That is, is it possible to split according to X and y's given order? – liang Jan 21 '17 at 12:27
1

@liang no it doesn't have to be random. you could just say the train, test, and validation set sizes will be a, b, and c percent of the size of the total dataset. let's say `a=0.7`, `b=0.15`, `c=0.15`, and `d = dataset`, `N=len(dataset)`, then `x_train = dataset[0:int(a*N)]`, `x_test = dataset[int(a*N):int((a+b)*N)]`, and `x_val = dataset[int((a+b)*N):]`. – offwhitelotus Jan 21 '17 at 16:26
@offwhitelotus Obviously it's possible and even easy in python, but it still takes a few lines. If the train_test_split one liner has an option to do it, it'd be even easier. – liang Jan 22 '17 at 15:28
@liang as far as I know the only simple way to do that in sklearn is with the KFold splitter, but this will give you many splits: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold – offwhitelotus Jan 22 '17 at 21:14
@offwhitelotus But KFold doesn't seem to allow user to specify the percentage. KFold does have a shuffle option, if only train_test_split has that. – liang Jan 23 '17 at 08:47
@liang train_test_split always shuffles the data. As for KFold percentage, you can work around that. For example, if you wanted a test set with 5% of the total data, you can specify that you want 1/0.05 folds, and each of the 20 folds will be 5% of the data. – offwhitelotus Jan 24 '17 at 17:30
@offwhitelotus While I agree with the methodology, on a different note, I strongly believe that a "VALIDATION" set is something taken as a split from "TRAIN" set , but not from "TEST" set. If we use the suggested snippet, the validation set will be taken from the test set, the model will see part of data from the test set and fine tune on that validation set, which it is never supposed to see; hence biasing the model. – Amith Adiraju May 08 '20 at 19:22
2

Deprecated: https://stackoverflow.com/a/34844352/4237080, use `from sklearn.model_selection import train_test_split` – brienna Jun 08 '20 at 23:59

score 15 · Answer 4 · answered Mar 31 '17 at 18:18

As sklearn.cross_validation module was deprecated, you can use:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

Apogentus · Answer 5 · 2014-12-10T23:05:03.627

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

Thank you! The naming is somewhat misleading, `value_inds` are truly indices, but the output are not indices, only masks. — greenoldman, Sep 02 '17 at 12:53

score 2 · Answer 6 · answered May 03 '19 at 22:55

After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I'm sure will give the best results, giving that it is well designed and tested code):

shuffle the whole matrix arr and then split the data to train and test
shuffle the indices and then assign it x and y to split the data
same as method 2, but in a more efficient way to do it
using pandas dataframe to split

method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

Method 1: 0.35883826200006297 seconds
Method 2: 1.7157016959999964 seconds
Method 3: 1.7876616719995582 seconds
Method 4: 0.07562861499991413 seconds

I hope that's helpful!

great share, wonder why no upvotes :) – chia yongkang Nov 02 '21 at 07:20 — chia yongkang, Nov 02 '21 at 07:20

score 1 · Answer 7 · answered Sep 09 '10 at 18:23

I wrote a function for my own project to do this (it doesn't use numpy, though):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

If you want the chunks to be randomized, just shuffle the list before passing it in.

score 1 · Answer 8 · edited May 07 '20 at 09:42

1

Split into train test and valid

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

edited May 07 '20 at 09:42

Jaimil Patel

1,301
6
13

answered May 07 '20 at 04:43

Rajat Subhra Bhowmick

51
5

score 0 · Answer 9 · answered Oct 24 '16 at 12:34

Here is a code to split the data into n=5 folds in a stratified manner

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

score 0 · Answer 10 · edited Nov 28 '16 at 22:59

Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

score 0 · Answer 11 · answered Nov 26 '19 at 16:45

Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.

Check out the np.split:

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in

ary[:2] ary[2:3] ary[3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

score 0 · Answer 12 · answered Oct 19 '20 at 08:34

I'm aware that my solution is not the best, but it comes in handy when you want to split data in a simplistic way, especially when teaching data science to newbies!

def simple_split(descriptors, targets):
    testX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 0]
    validX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 1]
    trainX_indices = [i for i in range(descriptors.shape[0]) if i % 4 >= 2]

    TrainX = descriptors[trainX_indices, :]
    ValidX = descriptors[validX_indices, :]
    TestX = descriptors[testX_indices, :]

    TrainY = targets[trainX_indices]
    ValidY = targets[validX_indices]
    TestY = targets[testX_indices]

    return TrainX, ValidX, TestX, TrainY, ValidY, TestY

According to this code, data will be split into three parts - 1/4 for the test part, another 1/4 for the validation part, and 2/4 for the training set.

score 0 · Answer 13 · answered Mar 31 '22 at 13:11

Yet another pure numpy way to split the dataset. This solution is based on numpy.split which has already been mentioned before but I add this here for reference.

# Dataset
dataset = np.load(...)                      # Dataset of shape N x (d1 ... dM)

# Splitting and shuffling with indexes
idx = np.arange(len(dataset))               # Vector of dataset samples idx
id_train = int(len(idx) * 0.8)              # Train 80%
id_valid = int(len(idx) * (0.8 + 0.05))     # Valid 5%, Test 15%
train, valid, test = np.split(idx, (id_train, id_valid))

# Indexing dataset subsets
dataset_train = dataset[train]              # Train set
dataset_valid = dataset[valid]              # Valid set
dataset_test = dataset[test]                # Test set

score 0 · Answer 14 · answered Mar 06 '23 at 20:50

0

Here is the another way of splitting the dataset.You can create a mask to select random rows using np.random.rand() function:

msk = np.random.rand(len(df)) < 0.8

train = cdf[msk]
test = cdf[~msk]

answered Mar 06 '23 at 20:50

mustafa can nacak

147
1
1
5

How to split/partition a dataset into training and test datasets for, e.g., cross validation?

14 Answers14

Linked

Related