540

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

Thanks!

tooty44
  • 6,829
  • 9
  • 27
  • 39

30 Answers30

952

Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
o-90
  • 17,045
  • 10
  • 39
  • 63
  • 36
    This will return numpy arrays and not Pandas Dataframes however – Bar Oct 22 '14 at 15:10
  • 190
    Btw, it does return a Pandas Dataframe now (just tested on Sklearn 0.16.1) – Julien Marrec Jul 08 '15 at 10:30
  • 5
    If you're looking for KFold, its a bit more complex sadly. `kf = KFold(n, n_folds=folds) for train_index, test_index in kf: X_train, X_test = X.ix[train_index], X.ix[test_index]` see full example here: https://www.quantstart.com/articles/Using-Cross-Validation-to-Optimise-a-Machine-Learning-Method-The-Regression-Setting – ihadanny Feb 23 '16 at 13:13
  • 13
    In new versions (0.18, maybe earlier), import as `from sklearn.model_selection import train_test_split` instead. – Mark Oct 19 '16 at 17:24
  • 1
    See official docs [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) – Noel Evans Nov 07 '16 at 21:08
  • 10
    In the newest SciKit version you need to call it now as: `from sklearn.cross_validation import train_test_split` – horseshoe Mar 22 '17 at 09:32
  • 9
    @horseshoe the cv module is deprecated: `DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)` – Kingz Jul 19 '17 at 18:24
480

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Since `msk` returns an array of bools, perhaps `df.iloc` should be `df.loc` lest True/False be treated as 1,0 indices. – unutbu Jun 10 '14 at 17:37
  • @unutbu hmmmmmm good point, I was thinking the same about the loc ambiguity (if they are labelled with 0 or 1... maybe best not to use at all? – Andy Hayden Jun 10 '14 at 17:51
  • 3
    Sorry, my mistake. As long as `msk` is of dtype `bool`, `df[msk]`, `df.iloc[msk]` and `df.loc[msk]` always return the same result. – unutbu Jun 10 '14 at 18:32
  • 3
    I think you should use `rand` to `< 0.8` make sense because it returns uniformly distributed random numbers between 0 and 1. – R. Max Jun 10 '14 at 18:43
  • @AndyHayden, in your example, if I change 0.8 to 0.2 I get `len(train)` equal to 59 and `len(test)` equal to 41. – R. Max Jun 10 '14 at 23:51
  • Thanks for the response, but would you know of any way to split into train and test samples without converting the dataframe to a numpy array? Currently, my code for binning the data requires input in the form of a dataframe. Thanks! – tooty44 Jun 11 '14 at 13:22
  • 1
    @user3712008: this doesn't convert anything into a numpy array, but rather uses numpy to create the mask for indices. the construction of the dataframe at the start uses numpy as well, but that is incidental – watsonic Jan 20 '16 at 19:38
  • You'll run into problems if you want to stratify your selections, rather than just split at random. It looks like you can now do that as well as pass dataframes to the sklearn train_test_split function: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html – berto77 Sep 14 '16 at 20:51
  • 5
    Can someone explain purely in python terms what exactly happens in lines `in[12]`, `in[13]`, `in[14]`? I want to understand the python code itself here – kuatroka May 15 '17 at 17:04
  • 2
    `np.random.rand(N)` creates a numpy array of random floats between 0 and 1, so whether the value is < 0.8 classifies them into 80% and 20%. The ~ is the invert operator (so True becomes False, False becomes True for boolean), df[booleans] filters the DataFrame to just have the true values. – Andy Hayden May 15 '17 at 18:05
  • 15
    The answer using **sklearn** from *gobrewers14* is the better one. It's less complex and easier to debug. I recommend using the answer below. – So S Oct 02 '17 at 15:51
  • Why use `np.random.randn` for `msk`? Wouldn't `np.random.uniform` be a better idea? – iamwhoiam May 20 '18 at 13:12
  • 3
    @kuatroka `np.random.rand(len(df))` is an array of size `len(df)` with randomly and uniformly distributed float values in range [0, 1]. The `< 0.8` applies the comparison element-wise and stores the result in place. Thus values < 0.8 become `True` and value >= 0.8 become `False` – Kentzo Dec 06 '18 at 00:40
  • A slightly different question: If I now have the dataframes `train` and `test`, what is the best way to call Tensorflow's `model.fit()` method? Is there a way, to directly use the dataframes or do I need to convert them to Numpy arrays first? – JavAlex Jan 03 '21 at 11:57
  • I suggest avoiding variable naming conventions like `msk`. That's not an industry standard abbreviation for "mask" and eliminating time spent typing that vowel doesn't offset the loss of readability of the code in general, particularly for non-native speakers who might read that and be confused. – rileymcdowell Jun 18 '21 at 16:17
  • Better use sklearn, since you can use shuffle=False for time series data. – Petar Ulev Sep 23 '22 at 14:14
419

Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.

RajV
  • 6,860
  • 8
  • 44
  • 62
PagMax
  • 8,088
  • 8
  • 25
  • 40
  • What does .index mean / where is the documentation for .index on a DataFrame? I can't find it. – dmonopoly Feb 13 '17 at 16:47
  • @dmonopoly, it is exactly what it looks like. df.index retruns index object of that dataframe. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html#pandas.Index also some discussion at http://stackoverflow.com/questions/17241004/pandas-how-to-get-the-data-frame-index-as-an-array – PagMax Feb 14 '17 at 03:28
  • 2
    what is `random_state` arg doing? – Rishabh Agrahari Nov 01 '17 at 12:42
  • 2
    @RishabhAgrahari randomly shuffles different data split every time according to the frac arg. If you want to control the randomness you can state your own seed, like in the example. – MikeL Nov 15 '17 at 09:32
  • 16
    This seems to work well and a more elegant solution than bringing in sklearn. Is there a reason why this shouldn't be a better accepted answer? – RajV Aug 07 '19 at 15:03
  • 3
    @RajV in its current form `test` will be randomly selected but rows will be in their original order. The sklearn approach shuffles both train and test. – peer Aug 28 '19 at 10:34
  • 1
    Better solution. scikit learn doesn't take raw data frames, rather it expects arrays. – Cybernetic Nov 08 '19 at 20:21
  • 11
    @peer that limitation is easily remedied if a shuffled `test` set is desired as pointed out here https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows. `test=df.drop(train.index).sample(frac=1.0)` – Alok Lal Dec 05 '19 at 21:38
  • @Cybernetic, the current version of sklearn does take data frames but returns arrays. – Denis Kazakov Nov 17 '22 at 12:23
  • @RajV, one difference of pandas sample is that it returns a data frame with column names preserved, unlike sklearn, which may be an advantage. – Denis Kazakov Nov 17 '22 at 12:24
42

I would use scikit-learn's own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
Napitupulu Jon
  • 7,713
  • 3
  • 22
  • 23
  • 4
    The `cross_validation` module is now deprecated: `DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.` – Harry Nov 05 '16 at 23:23
  • This gives an error when I do it with a `df` whose `output` column is strings. I get `TypeError: '<' not supported between instances of 'str' and 'float'`. It appears that `y` needs to be a `DataFrame` not a `Series`. Indeed, appending `.to_frame()` either the definition of `y` or the argument `y` in `train_test_split` works. If you're using `stratify = y`, you need to make sure that this `y` is a `DataFrame` too. If I instead define `y = df[["output"]]` and `X = df.drop("output", axis = 1)` then it works too; this is basically the same as appending `.to_frame()` to the definition of `y`. – Sam OT Apr 29 '21 at 08:44
33

No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

And if you want to split x from y

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

And if you want to split the whole df

X, y = df[list_of_x_cols], df[y_col]
Nosey
  • 714
  • 7
  • 14
27

There are many ways to create a train/test and even validation samples.

Case 1: classic way train_test_split without any options:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
double-beep
  • 5,031
  • 17
  • 33
  • 41
yannick_leo
  • 449
  • 4
  • 2
15

You can use below code to create test and train samples :

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

Test size can vary depending on the percentage of data you want to put in your test and train dataset.

user1775015
  • 179
  • 1
  • 6
9

There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
Abhi
  • 1,153
  • 1
  • 23
  • 38
7

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.

Apogentus
  • 6,371
  • 6
  • 32
  • 33
6

You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
4

If you need to split your data with respect to the lables column in your data set you can use this:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

and use it:

train, test = split_to_train_test(data, 'class', 0.7)

you can also pass random_state if you want to control the split randomness or use some global random seed.

MikeL
  • 5,385
  • 42
  • 41
4

To split into more than two classes such as train, test, and validation, one can do:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]

This will put approximately 70% of data in training, 15% in test, and 15% in validation.

AHonarmand
  • 530
  • 1
  • 8
  • 16
  • 1
    You might want to edit your answer to add "approximately", if you run the code you will see that it can be quite off from the exact percentage. e.g. I tried it on 1000 items and got: 700, 141, 159 - so 70%, 14% and 16%. – stason Feb 05 '20 at 21:06
4
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
elyte5star
  • 167
  • 5
  • 3
    This would be a better answer if you explained how the code you provided answers the question. – pppery Jun 17 '20 at 20:46
  • 2
    While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – shaunakde Jun 17 '20 at 21:31
  • 1
    the first line returns a shuffled range(with respect to the size of the dataframe).The second line represents the desired fraction of the test set.The third and forth line incorporates the fraction into the shuffled range.The rest lines should be self explanatory.Regards. – elyte5star Jun 17 '20 at 21:38
  • Adding this explanation to the answer itself will be optimal :) – Sheece Gardazi Apr 22 '21 at 01:54
3

Just select range row from df like this

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
Liran Orevi
  • 4,755
  • 7
  • 47
  • 64
Makio
  • 465
  • 6
  • 15
  • 3
    This would only work if the data in the dataframe is already randomly ordered. If the dataset is derived from ultiple sources and has been appended to the same dataframe then it's quite possible to get a very skewed dataset for training/testing using the above. – Emil L May 12 '17 at 08:17
  • 1
    You can shuffle dataframe before split it http://stackoverflow.com/questions/29576430/shuffle-dataframe-rows – Makio May 12 '17 at 08:59
  • 1
    Absolutelty! If you add that `df` in your code snippet is (or should be) shuffled it will improve the answer. – Emil L May 12 '17 at 13:55
3
import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
Pardhu Gopalam
  • 179
  • 1
  • 6
  • 2
    You have a short mistake. You should drop target column before, you put it into train_test_split. data = data.drop(columns = ['column_name'], axis = 1) – Anton Erjomin Aug 06 '18 at 14:35
2

This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
Anarcho-Chossid
  • 2,210
  • 4
  • 27
  • 44
2

There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
biendltb
  • 1,149
  • 1
  • 13
  • 20
2

if you want to split it to train, test and validation set you can use this function:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val
otto
  • 1,815
  • 7
  • 37
  • 63
1

If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data
Johnny V
  • 1,108
  • 14
  • 21
1

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
Hakim
  • 1,242
  • 1
  • 10
  • 22
1

You can make use of df.as_matrix() function and create Numpy-array and pass it.

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
kiran6
  • 1,247
  • 2
  • 13
  • 19
1

A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
thebeancounter
  • 4,261
  • 8
  • 61
  • 109
1

you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
Shaina Raza
  • 1,474
  • 17
  • 12
1

In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution

First, assign a unique id to a dataframe (if already not exist)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

Here are my split numbers:

train = 120765
test  = 4134
dev   = 2816

The split function

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

Now splitting into train, test, dev

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
  • 1
    resetting index is important if you are using datasets and dataloaders or even otherwise it is a good convention. This is the only answer that talks of reindexing. – Allohvk Jun 30 '21 at 09:57
1

The sample method selects a part of data, you can shuffle the data first by passing a seed value.

train = df.sample(frac=0.8, random_state=42)

For test set you can drop the rows through indexes of train DF and then reset the index of new DF.

test = df.drop(train_data.index).reset_index(drop=True)
  • 3
    Please read [answer] and [edit] your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post. – Adriaan Nov 02 '22 at 06:37
  • I think it's self explanatory. OP asked for splitting df into train and test, which these two variables represents. I'll still read the linked doc though. Thanks – umair mughal Nov 02 '22 at 08:04
  • 3
    The mere fact that the OP asked about this shows they don't have a complete understanding of Pandas, which on its own is enough to merit an explanation as to why this works. – Adriaan Nov 02 '22 at 08:09
  • 2
    But that is a clone of an already existing and highly upvoted answer. Please, when answering to old questions, be sure to bring new information that was not present in previous answers (for example, because of technical changes since), and to explicitly make clear what is new. – chrslg Nov 07 '22 at 09:27
1

I do this in 2 ways.
Method 1:
from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Method 2:
from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

Also for larger dataframes, please check out Intel® Distribution of Modin* instead of pandas (https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html#gs.1dtwen) and Intel® Extension for Scikit-learn* (https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html#gs.1dtvml). These framework optimizations will help to accelerate performance on Intel hardware.

Ramya R
  • 163
  • 8
0

How about this? df is my dataframe

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
Akash Jain
  • 267
  • 5
  • 9
0

I would use K-fold cross validation. It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

0

Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.

Method 1: Split rows into train, validate, test dataframes.

train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df

Method 2: Split rows when validate must be subset of train (fastai)

train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
BSalita
  • 8,420
  • 10
  • 51
  • 68
0

That's what I do:

train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset
Nathan B
  • 1,625
  • 1
  • 17
  • 15