26

Let's say I've read in a textfile using a TextLineReader. Is there some way to split this into train and test sets in Tensorflow? Something like:

def read_my_file_format(filename_queue):
  reader = tf.TextLineReader()
  key, record_string = reader.read(filename_queue)
  raw_features, label = tf.decode_csv(record_string)
  features = some_processing(raw_features)
  features_train, labels_train, features_test, labels_test = tf.train_split(features,
                                                                            labels,
                                                                            frac=.1)
  return features_train, labels_train, features_test, labels_test
Luke
  • 6,699
  • 13
  • 50
  • 88
  • Related: https://stackoverflow.com/questions/54519309/split-tfrecords-file-into-many-tfrecords-files – xdhmoore Jan 22 '21 at 02:12

5 Answers5

19

As elham mentioned, you can use scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection module, which handles comparing, validating and choosing parameters.

The model_selection.train_test_split() method is specifically designed to split your data into train and test sets randomly and by percentage.

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

test_size is the percentage to reserve for testing and random_state is to seed the random sampling.

I typically use this to provide train and validation data sets, and keep true test data separately. You could just run train_test_split twice to do this as well. I.e. split the data into (Train + Validation) and Test, then split Train + Validation into two separate tensors.

Engineero
  • 12,340
  • 5
  • 53
  • 75
Jspies
  • 369
  • 3
  • 7
  • 15
    Thanks, but this does not answer the question. I'm using a `TextLineReader` so the data is now a tensor. scikit-learn works on numpy arrays not tensorflow tensors. – Luke Apr 19 '17 at 22:29
  • 2
    Gotcha. I thought it should work with any python type that is enumerable. I'll have to give it a try. – Jspies Apr 20 '17 at 22:43
16

Something like the following should work: tf.split_v(tf.random_shuffle(...

Edit: For tensorflow>0.12 This should now be called as tf.split(tf.random.shuffle(...

Reference

See docs for tf.split and for tf.random.shuffle for examples.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
user1454804
  • 1,070
  • 7
  • 6
7
import sklearn.model_selection as sk

X_train, X_test, y_train, y_test = 
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)
Brian
  • 7,098
  • 15
  • 56
  • 73
elham shawky
  • 81
  • 1
  • 3
  • 9
    Whilst this code snippet is welcome, and may provide some help, it would be [greatly improved if it included an explanation](//meta.stackexchange.com/q/114762) of *how* and *why* this solves the problem. Remember that you are answering the question for readers in the future, not just the person asking now! Please [edit] your answer to add explanation, and give an indication of what limitations and assumptions apply. – Toby Speight Apr 03 '17 at 12:49
  • I agree that this answer needs explanation, but it is very helpful as it points the OP in the right direction. sklearn.model_selection provides great tools for splitting into train, validation, and test sets. You could "manually" split the data with tensorflow.split_v but sklearn will do it for you! – Jspies Apr 19 '17 at 14:05
  • to split a data into train and test, use train_test_split function from sklearn.model_selection. you need to determine the percentage of splitting. test_size=0.33 means that 33% of the original data will be for test and remaining will be for train. This function will return four elements the data and labels for train and test sets. X denotes data and y denotes labels – elham shawky Feb 23 '18 at 14:15
  • I guess, it is better to do the at the end of pre-processing. Why do you need with tensors? I am just curious. – chikitin Nov 10 '19 at 21:45
3

I managed to have a nice result using the map and filter functions of the tf.data.Dataset api. Just use the map function to randomly select the examples between train and testing. In order to do that you can, for each example, get a sample from a uniform distribution and check if the sample value is below the rate division.

def split_train_test(parsed_features, train_rate):
    parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
    return parsed_features

def grab_train_examples(parsed_features):
    return parsed_features['is_train']

def grab_test_examples(parsed_features):
    return ~parsed_features['is_train']
3

I've improvised a solution by encapsulating the train_test_split function from sklearn in order to accept tensors as input and to return tensors as well.

I'm new to tensorflow and facing the same issue, so if you have a better solution without using a different package I'd appreciate.

def train_test_split_tensors(X, y, **options):
    """
    encapsulation for the sklearn.model_selection.train_test_split function
    in order to split tensors objects and return tensors as output

    :param X: tensorflow.Tensor object
    :param y: tensorflow.Tensor object
    :dict **options: typical sklearn options are available, such as test_size and train_size
    """

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)

    X_train, X_test = tf.constant(X_train), tf.constant(X_test)
    y_train, y_test = tf.constant(y_train), tf.constant(y_test)

    del(train_test_split)

    return X_train, X_test, y_train, y_test
The Doctor
  • 186
  • 1
  • 5
  • 1
    The `.numpy()` after my encoded tensor worked for me! I then just passed the output from sklearn to keras sequential model just fine. Thank you! – Andre Jul 26 '22 at 17:50