Cross validation of dataset separated on files

Question

The dataset that I have is separated on different files grouped on samples that know each other, i.e., they were created on similar conditions on a similar time. The balance of the train-test dataset is important so the samples have to be on train or test, but cannot be separated. So KFold it is not simple to use on my scikit-learn code.

Right now, I am using something similar to LOO making something like:

train ~> cat ./dataset/!(1.txt)
test ~> cat ./dataset/1.txt

Which is not confortable and not very useful if I want to make folds on test of several files and make a "real" CV. How would be possible to make a good CV to check real overfitting?

score 0 · Accepted Answer · edited May 23 '17 at 11:52

Looking to this answer, I've realized that pandas can concatenate dataframes. I checked that the process is 15-20% slower than cat command-line but makes able to do folds as I was expecting.

Anyway, I am quite sure that there should be any other better way than this one:

import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold

allFiles = glob.glob("./dataset/*.txt")
kf = KFold(len(allFiles), n_folds=3, shuffle=True)

for train_files, cv_files in kf:
    dataTrain = pd.concat((pd.read_csv(allFiles[idTrain], header=None) for idTrain in train_files))
    dataTest = pd.concat((pd.read_csv(allFiles[idTest], header=None) for idTest in cv_files))

Cross validation of dataset separated on files

1 Answers1