Stratified splitting of pandas dataframe into training, validation and test set

Question

The following extremely simplified DataFrame represents a much larger DataFrame containing medical diagnoses:

medicalData = pd.DataFrame({'diagnosis':['positive','positive','negative','negative','positive','negative','negative','negative','negative','negative']})
medicalData

    diagnosis
0   positive
1   positive
2   negative
3   negative
4   positive
5   negative
6   negative
7   negative
8   negative
9   negative

Problem: For machine learning, I need to randomly split this dataframe into three subframes in the following way:

trainingDF, validationDF, testDF = SplitData(medicalData,fractions = [0.6,0.2,0.2])

...where the split array specifies the fraction of the complete data that goes into each subframe.

the data in the subframe needs to be mutually exclusive and the split array (fractions) need to sum to one.
Aditionally, the fraction of positive diagnoses in each subset needs to be approximately the same.
Answers to this question recommend using the pandas sample method or the train_test_split function from sklearn. But none of these solutions seem to generalize well to n splits and none provides a stratified split.

You know you could just split the test into two parts again. — cs95, Jun 10 '18 at 07:45
Thanks, but I have explicitely mentioned in my question that these solutions dont cover my second requirement, that each subset needs to approximately contain the same fraction of positive samples. — Oblomov, Jun 10 '18 at 07:48

score 23 · Accepted Answer · edited Jun 10 '18 at 08:24

23

`np.array_split`

If you want to generalise to n splits, np.array_split is your friend (it works with DataFrames well).

fractions = np.array([0.6, 0.2, 0.2])
# shuffle your input
df = df.sample(frac=1) 
# split into 3 parts
train, val, test = np.array_split(
    df, (fractions[:-1].cumsum() * len(df)).astype(int))

`train_test_split`

A windy solution using train_test_split for stratified splitting.

y = df.pop('diagnosis').to_frame()
X = df

X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.4)

X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, stratify=y_test, test_size=0.5)

Where X is a DataFrame of your features, and y is a single-columned DataFrame of your labels.

edited Jun 10 '18 at 08:24

Oblomov

8,953
22
60
106

answered Jun 10 '18 at 07:48

cs95

379,657
97
704
746

Thanks, that covers my first requirement. But what about my second requirement that each subset needs to approximately contain the same fraction of positive samples? – Oblomov Jun 10 '18 at 07:50
@user1934212 assuming you have an equal number of samples and enough data, it should be fine (thanks to randomness). But if you're particular on that stuff, I don't think you can work with this. Maybe look into StratifiedKFold splitting with sklearn. – cs95 Jun 10 '18 at 07:52
The nature of medical data is, that there are usually much less positive diagnosis than negative diagnosis. At least in my case #positive/#negative == 20/80 – Oblomov Jun 10 '18 at 07:54
can I just pass the dataframe into train_test_split? Or what are the parameters X and y based on my code sample? – Oblomov Jun 10 '18 at 07:59
@user1934212 y is your column of labels, and X is every column that does not include y. – cs95 Jun 10 '18 at 08:05
@user1934212 One last thing, please ensure y has a shape of `[xxx, 1]`. – cs95 Jun 10 '18 at 08:06
I dont understand your last comment. Could you improve your answer such that it uses the dataFrame object I use in my question? – Oblomov Jun 10 '18 at 08:08
Many more, but I left the others out in my example for the sake of simplicity. – Oblomov Jun 10 '18 at 08:10
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172833/discussion-between-user1934212-and-coldspeed). – Oblomov Jun 10 '18 at 08:31
@user1934212 I'm not going to be active, so if you have any followups I'd recommend opening a new question :) – cs95 Jun 10 '18 at 08:36
@cs95: You use `df.pop()` to remove the dataframe column that has the y label. What if I want to keep the y label with the dataframe? Would you remove the `pop()` and just do `y = df['diagnosis'].to_frame()`? – stackoverflowuser2010 Mar 12 '20 at 07:17
@stackoverflowuser2010 `df[ ['diagnosis'] ]` pass a list of columns, it's easier. – cs95 Mar 12 '20 at 07:41

score 10 · Answer 2 · answered Jan 05 '21 at 00:28

Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split() twice.

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Below is a complete working example.

Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% bar and 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

Here is the example dataset:

df = pd.DataFrame( { 'A': list(range(0, 100)),
                     'B': list(range(100, 0, -1)),
                     'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )

df.head()
#    A    B label
# 0  0  100   foo
# 1  1   99   foo
# 2  2   98   foo
# 3  3   97   foo
# 4  4   96   foo

df.shape
# (100, 3)

df.label.value_counts()
# foo    75
# bar    15
# baz    10
# Name: label, dtype: int64

Now, let's call the split_stratified_into_train_val_test() function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

df_train, df_val, df_test = \
    split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)

The three dataframes df_train, df_val, and df_test contain all the original rows but their sizes will follow the above ratio.

df_train.shape
#(60, 3)

df_val.shape
#(20, 3)

df_test.shape
#(20, 3)

Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% bar and 10% baz.

df_train.label.value_counts()
# foo    45
# bar     9
# baz     6
# Name: label, dtype: int64

df_val.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

df_test.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

this function returns the following error ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. value_counts() Skyra_0 105 Skyra_2 37 Skyra_1 29 Skyra_3 18 TrioTim_4 7 Skyra_4 5 TrioTim_2 5 TrioTim_5 5 Skyra_5 5 TrioTim_3 3 — fabio.geraci, Oct 19 '21 at 08:35

Lindafr · Answer 3 · 2023-05-29T11:29:47.193

To @stackoverflowuser2010 answer I added a dictionary for assigning manual ratios for the less frequent labels (<10) that gave an error with the function {amount_of_examples: [train_length, val, test]}. Here is the outcome:

import pandas as pd
from sklearn.model_selection import train_test_split
def split_stratified_into_train_val_test(df_input, stratify_tuples_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None,
                                         ratio_dict = {3: [1,1,1], 4: [2,1,1], 5: [2,2,1], 
6: [2,2,2], 7: [3,2,2], 8: [4,2,2], 9: [5,2,2]}
                                         ):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split()
ratio_dict : dict
dict for manual ratio {amount_of_examples: [train_length, dev, test]}

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''
    
    #checks
    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_tuples_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_tuples_colname))
    
    #create freq_dict
    label_freq_dict = df[stratify_tuples_colname].value_counts().to_dict()
    #Those with less than 10 occurances are too little for train_test_split logic.
    # Take out to deal with them later
    df_input["is_frequent_enough"] = df_input[stratify_tuples_colname].apply(lambda x: True if label_freq_dict[x] >= 10 else False)
    rare_labels_df = df_input.query('is_frequent_enough == False', engine='python')
    df_input = df_input.drop(rare_labels_df.index)

    X = df_input # Contains all columns.
    y = df_input[[stratify_tuples_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)
    
    #Add rare_labels_df into the sets manually
    rare_labels = rare_labels_df[stratify_tuples_colname].unique()
    
    for rare_label in rare_labels:
        mini_df = rare_labels_df[rare_labels_df[stratify_tuples_colname] == rare_label].copy()
        mini_df_len = len(mini_df)
        if mini_df_len <= 2: #If not 1 example for every set, then exclude
            continue
        dev_test = mini_df.tail(ratio_dict[len(mini_df)][1]+ ratio_dict[len(mini_df)][2])
        train = mini_df.drop(dev_test.index)
        test = dev_test.tail(ratio_dict[len(mini_df)][2])
        dev = dev_test.drop(test.index)
        assert mini_df_len == len(train) + len(dev) + len(test)
        df_val = pd.concat([df_val, dev])
        df_train = pd.concat([df_train, train])
        df_test = pd.concat([df_test, test])

    #assert len(df_input)+len(rare_labels_df) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Tom Hale · Answer 4 · 2019-07-29T01:31:41.110

-1

Pure `pandas` solution

To split into train / validation / test in the ratio 70 / 20 / 10%:

train_df = df.sample(frac=0.7, random_state=random_seed)
tmp_df = df.drop(train_df.index)
test_df = tmp_df.sample(frac=0.33333, random_state=random_seed)
valid_df = tmp_df.drop(test_df.index)

assert len(df) == len(train_df) + len(valid_df) + len(test_df), "Dataset sizes don't add up"
del tmp_df

edited Jul 29 '19 at 01:31

answered Jul 29 '19 at 01:26

Tom Hale

40,825
36
187
242

2

This doesn't get at the OP request of stratified splits. – John Stud Jul 21 '20 at 03:42

Stratified splitting of pandas dataframe into training, validation and test set

4 Answers4

`np.array_split`

`train_test_split`

Pure `pandas` solution

Linked

Related

Stratified splitting of pandas dataframe into training, validation and test set

4 Answers4

np.array_split

train_test_split

Pure pandas solution

Linked

Related

`np.array_split`

`train_test_split`

Pure `pandas` solution