How to split data into 3 sets (train, validation and test)?

Question

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data in two sets (train and test). However, I couldn't find any solution about splitting the data into three sets. Preferably, I'd like to have the indices of the original data.

I know that a workaround would be to use train_test_split two times and somehow adjust the indices. But is there a more standard / built-in way to split the data into 3 sets instead of 2?

This doesn't answer your specific question, but I think the more standard approach for this would be splitting into two sets, train and test, and running cross-validation on the training set thus eliminating the need for a stand alone "development" set. — David, Jul 07 '16 at 16:40
This came up before, and as far as I know there is no built-in method for that yet. — ayhan, Jul 07 '16 at 16:43
I suggest Hastie et al.'s *The Elements of Statistical Learning* for a discussion on why to use three sets instead of two (https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf Model assessment and selection chapter) — ayhan, Jul 07 '16 at 17:16
@David In some models to prevent overfitting, there is a need for 3 sets instead of 2. Because in your design choices, you are somehow tuning parameters to improve performance on the test set. To prevent that, a development set is required. So, using cross validation will not be sufficient. — CentAu, Jul 07 '16 at 17:23
@CentAu I don't understand. You tune your model in cross-validation and then do a final test with your test set to ensure that your CV results line up. There are use cases where CV isn't a good call, like when there is inherit bias/leakage in data (ex: forecasting), but you wouldn't be looking for a general purpose function to split your dataset into 3 random chunks if that was the case. — David, Jul 07 '16 at 17:51
@David Ah I see. If you tune on CV and then evaluate on final test set it is fine then. I misinterpreted your comment. +1 — CentAu, Jul 07 '16 at 17:56
@ayhan, a corrected URL for that book is https://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf, chapter 7 (p. 219). — Camille Goudeseune, May 08 '17 at 19:50
Using `train_test_split` two times is two lines of code. If you absolutely want only one line of code, then just wrap them up in a function. That way you keep the benefits of using a well-tested function with community support etc. — Jean-François Corbett, Oct 22 '18 at 12:38
Also, there isn't anything magical about the number 3. You could in principle need a fourth set (and a fifth and...) at a later point if you need to do more layers of cross-validation and testing with more models that have already been tested with the existing "test" set. So having an implementation in `sklearn` that is specific to exactly 3 sets may not be the way to go. It could, however, make sense to have an implementation that can split into *n* sets. Could roll your own by recursively calling `train_test_split`. — Jean-François Corbett, Oct 22 '18 at 12:44

MaxU - stand with Ukraine · Accepted Answer · 2020-10-19T09:19:09.640

287

Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1, random_state=42)) and then split our data set into the following parts:

60% - train set,
20% - validation set,
20% - test set

In [305]: train, validate, test = \
              np.split(df.sample(frac=1, random_state=42), 
                       [int(.6*len(df)), int(.8*len(df))])

In [306]: train
Out[306]:
          A         B         C         D         E
0  0.046919  0.792216  0.206294  0.440346  0.038960
2  0.301010  0.625697  0.604724  0.936968  0.870064
1  0.642237  0.690403  0.813658  0.525379  0.396053
9  0.488484  0.389640  0.599637  0.122919  0.106505
8  0.842717  0.793315  0.554084  0.100361  0.367465
7  0.185214  0.603661  0.217677  0.281780  0.938540

In [307]: validate
Out[307]:
          A         B         C         D         E
5  0.806176  0.008896  0.362878  0.058903  0.026328
6  0.145777  0.485765  0.589272  0.806329  0.703479

In [308]: test
Out[308]:
          A         B         C         D         E
4  0.521640  0.332210  0.370177  0.859169  0.401087
3  0.333348  0.964011  0.083498  0.670386  0.169619

[int(.6*len(df)), int(.8*len(df))] - is an indices_or_sections array for numpy.split().

Here is a small demo for np.split() usage - let's split 20-elements array into the following parts: 80%, 10%, 10%:

In [45]: a = np.arange(1, 21)

In [46]: a
Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

edited Oct 19 '20 at 09:19

answered Jul 07 '16 at 16:56

MaxU - stand with Ukraine

205,989
36
386
419

@root what exactly is the frac=1 parameter doing? – SpiderWasp42 Mar 12 '17 at 02:52
2

@SpiderWasp42, `frac=1` instructs `sample()` function to return all (`100%` or fraction = `1.0`) rows – MaxU - stand with Ukraine Mar 12 '17 at 08:45
21

Thanks @MaxU. I'd like to mention 2 things to keep things simplified. First, use `np.random.seed(any_number)` before the split line to obtain same result with every run. Second, to make unequal ratio like `train:test:val::50:40:10` use `[int(.5*len(dfn)), int(.9*len(dfn))]`. Here first element denotes size for `train` (0.5%), second element denotes size for `val` (1-0.9 = 0.1%) and difference between the two denotes size for `test`(0.9-0.5 = 0.4%). Correct me if I'm wrong :) – sync11 May 07 '18 at 17:57
hrmm is it a mistake when you say "Here is a small demo for np.split() usage - let's split 20-elements array into the following parts: 90%, 10%, 10%:" I am pretty sure you mean 80%, 10%, 10% – Kevin Jan 11 '19 at 21:19
Hey, @MaxU I had a case, something somewhat similar. I was wondering if you could look at it for me to see if it is and help me there. Here is my question https://stackoverflow.com/questions/54847668/how-to-split-validate-and-test-data-without-using-sci-kit-learn/54848029#54848029 – Deepak M Feb 24 '19 at 02:17
@DeepakM, you can easily use provided solution two times - for X and y – MaxU - stand with Ukraine Feb 24 '19 at 07:58
I am trying to use this method for splitting a 4D data (e.g, a 3D image with multiple channels). But pandas is complaining that I can only pass a 2D input. Is there a way to use this method if the input data is in the form of a multidimensional array? – tobiuchiha Jan 25 '20 at 00:30
Does this np.split shuffle the data ? – Kathiravan Natarajan May 06 '20 at 04:38
Good solution and much needed for all practitioners. – Kathiravan Natarajan May 06 '20 at 04:44
@KathiravanNatarajan, yes. In the first example it will be shuffled first, using - df.sample(frac=1) – MaxU - stand with Ukraine May 06 '20 at 06:39
I would add the random seed to the Pandas sample method explicitly. The default is None, not the Numpy seed. So pass `df.sample(frac=1., random_state=42)` or some variable or class instead of 42. – grofte Oct 19 '20 at 08:56
@grofte, it's a good point, thank you! I've updated the answer correspondingly... – MaxU - stand with Ukraine Oct 19 '20 at 09:20
Note that by using Numpy you're missing out on sklearn functionality such as split stratification. – Steven Jan 04 '21 at 13:13
@sync11 I'm afraid you are wrong ! according to docs : https://numpy.org/doc/stable/reference/generated/numpy.split.html, and cosidering your comment splitted data should be : train = data[0:50%] val = data[50%:90%] and test = data[90%:] so train is 50%, val is 40% and test is 10% – Saeid Mohammadi Nejati Apr 27 '23 at 11:49

score 74 · Answer 2 · edited May 12 '19 at 00:10

74

However, one approach to dividing the dataset into train, test, cv with 0.6, 0.2, 0.2 would be to use the train_test_split method twice.

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

edited May 12 '19 at 00:10

DINA TAKLIT

7,074
10
69
74

answered Mar 21 '17 at 16:10

blitu12345

3,473
1
20
21

Suboptimal for large datasets – Maksym Ganenko Aug 30 '19 at 12:28
1

@MaksymGanenko Can you please elaborate ? – blitu12345 Sep 01 '19 at 06:43
You suggest to split data with two separate operations. Each data split involves data copying. So when you suggest to use two separate split operations instead of one you artificially create burden on both RAM and CPU. So your solution is suboptimal. Data split should be done with a single operation like `np.split()`. Also, it doesn't require additional dependency on `sklearn`. – Maksym Ganenko Sep 02 '19 at 09:28
@MaksymGanenko agreed on the extra burden on the memory, and for the same we can delete the original data from the memory i.e(xtrain & labels)! And about your suggestion for using numpy is somewhat limited to only integer data types what about other data types? – blitu12345 Sep 02 '19 at 09:44
1

With `np.split()` you can split indices and so you may reindex any datatype. If you look into `train_test_split()` you'll see that it does exactly the same way: define `np.arange()`, shuffle it and then reindex original data. But `train_test_split()` can't split data into three datasets, so its use is limited. In the context of the answer it's suboptimal (== wrong). – Maksym Ganenko Sep 02 '19 at 09:56
11

another benefit of this approach is that you can use the stratification parameters. – Ami Tavory Mar 24 '20 at 09:57
4

@MaksymGanenko: Who cares if its performance is suboptimal? Typically, you'd want to perform train/val/test splitting only once, so therefore you want to make sure it's done *correctly*, not just efficiently. For train/val/test splitting, you need to have stratified sampling, which is not available with Numpy `split()`; you have to implement stratification yourself. The sci-kit learn function does all that for you using `train_test_split()`. – stackoverflowuser2010 Jan 04 '21 at 23:42
@stackoverflowuser2010 Well, let's talk about it when you've got a dataset that barely fit into the memory ) ..or better doesn't fit into the RAM at all which is quite common for production tasks. – Maksym Ganenko Jan 05 '21 at 11:13
2

@MaksymGanenko: If data can't fit into memory, then I'd use Spark and its libraries to split the data. – stackoverflowuser2010 Jan 05 '21 at 18:03
If you want to make a small optimization, make an empty dataframe with the same indices as your original data. Pass that through this double split, then in the end take slices from the original dataframe using the indexes of the splits. – creanion Sep 27 '21 at 19:22

score 72 · Answer 3 · edited Oct 17 '19 at 23:53

72

Note:

Function was written to handle seeding of randomized set creation. You should not rely on set splitting that doesn't randomize the sets.

import numpy as np
import pandas as pd

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

Demonstration

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))
df

train, validate, test = train_validate_test_split(df)

train

validate

test

edited Oct 17 '19 at 23:53

Simon

5,464
6
49
85

answered Jul 07 '16 at 16:47

piRSquared

285,575
57
475
624

2

I believe this function requires a df with index values ranging from 1 to n. In my case, I modified the function to use df.loc as my index values were not necessarily in this range. – iOSBeginner Apr 16 '20 at 12:38
I'd add one more `np.random.seed()` (without the passed seed) to the end, to avoid seeding the rest of your code. – wrygiel Dec 06 '22 at 07:20

stackoverflowuser2010 · Answer 4 · 2020-05-18T18:08:37.620

Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split() twice.

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Below is a complete working example.

Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% bar and 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

Here is the example dataset:

df = pd.DataFrame( { 'A': list(range(0, 100)),
                     'B': list(range(100, 0, -1)),
                     'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )

df.head()
#    A    B label
# 0  0  100   foo
# 1  1   99   foo
# 2  2   98   foo
# 3  3   97   foo
# 4  4   96   foo

df.shape
# (100, 3)

df.label.value_counts()
# foo    75
# bar    15
# baz    10
# Name: label, dtype: int64

Now, let's call the split_stratified_into_train_val_test() function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

df_train, df_val, df_test = \
    split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)

The three dataframes df_train, df_val, and df_test contain all the original rows but their sizes will follow the above ratio.

df_train.shape
#(60, 3)

df_val.shape
#(20, 3)

df_test.shape
#(20, 3)

Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% bar and 10% baz.

df_train.label.value_counts()
# foo    45
# bar     9
# baz     6
# Name: label, dtype: int64

df_val.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

df_test.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

NameError: name 'df' is not defined. The 'df' in split_stratified_into_train_val_test() should be replaced with 'df_input'. — mdrafiqulrabin, May 18 '20 at 11:24
Thanks. I fixed it. The problem was in an error-handling path of the code. — stackoverflowuser2010, May 18 '20 at 18:09

Ken · Answer 5 · 2020-09-04T10:05:40.977

In the case of supervised learning, you may want to split both X and y (where X is your input and y the ground truth output). You just have to pay attention to shuffle X and y the same way before splitting.

Here, either X and y are in the same dataframe, so we shuffle them, separate them and apply the split for each (just like in chosen answer), or X and y are in two different dataframes, so we shuffle X, reorder y the same way as the shuffled X and apply the split to each.

# 1st case: df contains X and y (where y is the "target" column of df)
df_shuffled = df.sample(frac=1)
X_shuffled = df_shuffled.drop("target", axis = 1)
y_shuffled = df_shuffled["target"]

# 2nd case: X and y are two separated dataframes
X_shuffled = X.sample(frac=1)
y_shuffled = y[X_shuffled.index]

# We do the split as in the chosen answer
X_train, X_validation, X_test = np.split(X_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
y_train, y_validation, y_test = np.split(y_shuffled, [int(0.6*len(X)),int(0.8*len(X))])

A.Ametov · Answer 6 · 2018-11-16T20:06:21.247

It is very convenient to use train_test_split without performing reindexing after dividing to several sets and not writing some additional code. Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain change and could be counted as

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

score 1 · Answer 7 · edited Nov 21 '20 at 17:17

def train_val_test_split(X, y, train_size, val_size, test_size):
    X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = test_size)
    relative_train_size = train_size / (val_size + train_size)
    X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
                                                      train_size = relative_train_size, test_size = 1-relative_train_size)
    return X_train, X_val, X_test, y_train, y_val, y_test

Here we split data 2 times with sklearn's train_test_split

score 1 · Answer 8 · answered Nov 30 '20 at 22:20

Considering that df id your original dataframe:

1 - First you split data between Train and Test (10%):

my_test_size = 0.10

X_train_, X_test, y_train_, y_test = train_test_split(
    df.index.values,
    df.label.values,
    test_size=my_test_size,
    random_state=42,
    stratify=df.label.values,    
)

2 - Then you split the train set between train and validation (20%):

my_val_size = 0.20

X_train, X_val, y_train, y_val = train_test_split(
    df.loc[X_train_].index.values,
    df.loc[X_train_].label.values,
    test_size=my_val_size,
    random_state=42,
    stratify=df.loc[X_train_].label.values,  
)

3 - Then, you slice the original dataframe according to the indices generated in the steps above:

# data_type is not necessary. 
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
df.loc[X_test, 'data_type'] = 'test'

The result is going to be like this:

Note: This soluctions uses the workaround mentioned in the question.

score 0 · Answer 9 · answered May 23 '21 at 18:28

Split the dataset in training and testing set as in the other answers, using

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Then, if you fit your model, you can add validation_split as a parameter. Then you do not need to create the validation set in advance. For example:

from tensorflow.keras import Model

model = Model(input_layer, out)

[...]

history = model.fit(x=X_train, y=y_train, [...], validation_split = 0.3)

The validation set is meant to serve as a representative on-the-run-testing-set during training of the training set, taken entirely from the training set, be it by k-fold cross-validation (recommended) or by validation_split; then you do not need to create a validation set separately and still you split a dataset into the three sets you are asking for.

score 0 · Answer 10 · answered Oct 29 '21 at 15:32

ANSWER FOR ANY AMOUNT OF SUB-SETS:

def _separate_dataset(patches, label_patches, percentage, shuffle: bool = True):
    """
    :param patches: data patches
    :param label_patches: label patches
    :param percentage: list of percentages for each value, example [0.9, 0.02, 0.08] to get 90% train, 2% val and 8% test.
    :param shuffle: Shuffle dataset before split.
    :return: tuple of two lists of size = len(percentage), one with data x and other with labels y.
    """
    x_test = patches
    y_test = label_patches
    percentage = list(percentage)       # need it to be mutable
    assert sum(percentage) == 1., f"percentage must add to 1, but it adds to sum{percentage} = {sum(percentage)}"
    x = []
    y = []
    for i, per in enumerate(percentage[:-1]):
        x_train, x_test, y_train, y_test = train_test_split(x_test, y_test, test_size=1-per, shuffle=shuffle)
        percentage[i+1:] = [value / (1-percentage[i]) for value in percentage[i+1:]]
        x.append(x_train)
        y.append(y_train)
    x.append(x_test)
    y.append(y_test)
    return x, y

This work for any size of percentage. In your case, you should do percentage = [train_percentage, val_percentage, test_percentage].

score 0 · Answer 11 · answered Mar 22 '22 at 15:33

The easiest way that I could think of is mapping split fractions to array index as follows:

train_set = data[:int((len(data)+1)*train_fraction)]
test_set = data[int((len(data)+1)*train_fraction):int((len(data)+1)*(train_fraction+test_fraction))]
val_set = data[int((len(data)+1)*(train_fraction+test_fraction)):]

where data = random.shuffle(data)

Lars · Answer 12 · 2023-04-18T11:41:01.383

I am always using this method to do a train, test, validation split. It always splits your data in the desired sizes.

def train_test_val_split(df, train_size, val_size, test_size, random_state=42):
    """
    Splits a pandas dataframe into training, validation, and test sets.

    Args:
    - df: pandas dataframe to split.
    - train_size: float between 0 and 1 indicating the proportion of the dataframe to include in the training set.
    - val_size: float between 0 and 1 indicating the proportion of the dataframe to include in the validation set.
    - test_size: float between 0 and 1 indicating the proportion of the dataframe to include in the test set.
    - random_state: int or None, optional (default=42). The seed used by the random number generator.

    Returns:
    - train_df: pandas dataframe containing the training set.
    - val_df: pandas dataframe containing the validation set.
    - test_df: pandas dataframe containing the test set.

    Raises:
    - AssertionError: if the sum of train_size, val_size, and test_size is not equal to 1.
    """

    assert train_size + val_size + test_size == 1, "Train, validation, and test sizes must add up to 1."
    
    # Split the dataframe into training and test sets
    train_df, test_df = train_test_split(df, test_size=test_size, random_state=random_state)
    
    # Calculate the size of the validation set relative to the original dataframe
    val_ratio = val_size / (1 - test_size)
    
    # Split the training set into training and validation sets
    train_df, val_df = train_test_split(train_df, test_size=val_ratio, random_state=random_state)
    
    return train_df, val_df, test_df

EDIT: You can also include the split of X and y into train, test, val set directly:

def train_test_val_split(X, y, train_size, val_size, test_size, random_state=42):
    """
    Splits X and y into training, validation, and test sets.

    Args:
    - X: pandas dataframe or array containing the independent variables.
    - y: pandas series or array containing the dependent variable.
    - train_size: float between 0 and 1 indicating the proportion of the data to include in the training set.
    - val_size: float between 0 and 1 indicating the proportion of the data to include in the validation set.
    - test_size: float between 0 and 1 indicating the proportion of the data to include in the test set.
    - random_state: int or None, optional (default=42). The seed used by the random number generator.

    Returns:
    - X_train: pandas dataframe or array containing the independent variables for the training set.
    - X_val: pandas dataframe or array containing the independent variables for the validation set.
    - X_test: pandas dataframe or array containing the independent variables for the test set.
    - y_train: pandas series or array containing the dependent variable for the training set.
    - y_val: pandas series or array containing the dependent variable for the validation set.
    - y_test: pandas series or array containing the dependent variable for the test set.

    Raises:
    - AssertionError: if the sum of train_size, val_size, and test_size is not equal to 1.
    """

    assert train_size + val_size + test_size == 1, "Train, validation, and test sizes must add up to 1."
    
    # Concatenate X and y into a single dataframe
    df = pd.concat([X, y], axis=1)
    
    # Split the dataframe into training and test sets
    train_df, test_df = train_test_split(df, test_size=test_size, random_state=random_state)
    
    # Calculate the size of the validation set relative to the original dataframe
    val_ratio = val_size / (1 - test_size)
    
    # Split the training set into training and validation sets
    train_df, val_df = train_test_split(train_df, test_size=val_ratio, random_state=random_state)
    
    # Split the training, validation, and test dataframes into X and y values
    X_train, y_train = train_df.drop(columns=y.name), train_df[y.name]
    X_val, y_val = val_df.drop(columns=y.name), val_df[y.name]
    X_test, y_test = test_df.drop(columns=y.name), test_df[y.name]
    
    return X_train, X_val, X_test, y_train, y_val, y_test

santon · Answer 13 · 2023-05-16T23:09:05.043

Here's a version that recursively splits the input arrays using sklearn's train_test_split function. Therefore, it handles an arbitrary number of input arrays (just like the sklearn version) in addition to an arbitrary number of splits. I've tested it on 3 input arrays and four splits.

from typing import Any, Iterable, List, Union

import numpy as np
from sklearn.model_selection import train_test_split as _train_test_split


def train_test_split(
    *arrays: Any,
    sizes: Union[float, Iterable[float]] = None,
    random_state: Any = None,
    shuffle: bool = True,
    stratify: Any = None
) -> List:
    """Like ``sklearn.model_selection.train_test_split`` but handles multiple splits.

    Returns:
        A list of array splits. The first ``n`` elements are the splits of the first array and so
        on. For example, ``X_train``, ``X_valid``, ``X_test``.

    Examples:
        >>> train_test_split(list(range(10)), sizes=(0.7, 0.2, 0.1))
        [[0, 1, 9, 6, 5, 8, 2], [3, 7], [4]]
        >>> train_test_split(features, labels, sizes=(0.7, 0.2, 0.1))
    """
    if isinstance(sizes, float):
        sizes = [sizes]
    else:
        sizes = np.array(sizes, dtype="float")
        if len(sizes) > 1:
            sizes /= sizes.sum()

    train_size = sizes[0]
    common_args = dict(random_state=random_state, shuffle=shuffle, stratify=stratify)

    if len(sizes) <= 2:
        return _train_test_split(*arrays, train_size=train_size, **common_args)
    else:
        n = len(arrays)
        left_arrays = _train_test_split(*arrays, train_size=train_size, **common_args)
        right_arrays = train_test_split(*left_arrays[1::2], sizes=sizes[1:], **common_args)

        interleaved = []
        s = len(sizes) - 1
        for i in range(n):
            interleaved.append(left_arrays[i * 2])
            interleaved.extend(right_arrays[i * s : (i + 1) * s])

        return interleaved

How to split data into 3 sets (train, validation and test)?

13 Answers13

Note:

Demonstration

Linked

Related