How to perform k-fold cross validation with tensorflow?

Question

I am following the IRIS example of tensorflow.

My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data.

I have

data_set = tf.contrib.learn.datasets.base.load_csv(filename="mydata.csv",
                                                   target_dtype=np.int)

How can I perform k-fold cross validation on this dataset with multi-layer neural network as same as IRIS example?

score 39 · Answer 1 · answered May 10 '18 at 12:59

I know this question is old but in case someone is looking to do something similar, expanding on ahmedhosny's answer:

The new tensorflow datasets API has the ability to create dataset objects using python generators, so along with scikit-learn's KFold one option can be to create a dataset from the KFold.split() generator:

import numpy as np

from sklearn.model_selection import LeaveOneOut,KFold

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()

from sklearn.datasets import load_iris
data = load_iris()
X=data['data']
y=data['target']

def make_dataset(X_data,y_data,n_splits):

    def gen():
        for train_index, test_index in KFold(n_splits).split(X_data):
            X_train, X_test = X_data[train_index], X_data[test_index]
            y_train, y_test = y_data[train_index], y_data[test_index]
            yield X_train,y_train,X_test,y_test

    return tf.data.Dataset.from_generator(gen, (tf.float64,tf.float64,tf.float64,tf.float64))

dataset=make_dataset(X,y,10)

Then one can iterate through the dataset either in the graph based tensorflow or using eager execution. Using eager execution:

for X_train,y_train,X_test,y_test in tfe.Iterator(dataset):
    ....

What if `X` and `y` can not be held in-memory as is assumed by this snippet? I thought the whole point of using a generator was to load samples on-demand rather than load the entire dataset into memory. — fabiomaia, Dec 29 '18 at 17:23
@fabiomaia The same technique can be used to load them on-demand. For example, `X` could represent a list of filenames and in the for loop you load the files contents on-demand. — gw0, Jan 31 '20 at 15:41
@gw0 It is not working for large dataset(images). It still cost a lot of memory in the loop, no matter if you pre-load all data and split or splite then load on demand they are the same memory usage. I have tried both and program crash due to excessive memory usage. However, I figure it out by pass images file path and in each fold I create dataset base on splits indices for train and validation(test). Now it work without excessive memory usage. — NelsonPunch, Oct 25 '21 at 02:54

ahmedhosny · Answer 2 · 2017-05-03T15:20:54.813

14

NN's are usually used with large datasets where CV is not used - and very expensive. In the case of IRIS (50 samples for each species), you probably need it.. why not use scikit-learn with different random seeds to split your training and testing?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

for k in kfold:

split data differently passing a different value to "random_state"
learn the net using _train
test using _test

If you dont like the random seed and want a more structured k-fold split, you can use this taken from here.

from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
    print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

edited May 03 '17 at 15:20

answered Nov 20 '16 at 10:42

ahmedhosny

1,099
14
25

20

Answer is not related with the question!!! Should provide an answer with a Tensorflow solution – AGP Aug 19 '18 at 16:53
4

Since the answer offers a solution that is usable with Tensorflow - I can not see the Problem. – mrk Dec 19 '18 at 22:40
how can we make this even more randomized? – Mona Jalal Mar 30 '19 at 01:57

score 0 · Answer 3 · answered Aug 03 '21 at 22:26

modifying @ahmedhosny answer

from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=k)
train_ = []
test_ = []
for train_indices, test_indices in k_fold.split(all_data.index):
    train_.append(train_indices)
    test_.append(test_indices)

How to perform k-fold cross validation with tensorflow?

3 Answers3

Linked