How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

Question

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.

I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.

I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).

I am looking for a way to do the same using built-in methods in Tensorflow.

There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:

import tensorflow as tf

X_train, X_test, y_train, y_test = tf.train_test_split(features,
                                                labels,
                                                test_size=0.1,
                                                random_state=123)

Yes, thank you. I was hoping for a direct function to do it similar to how it's done in sklearn. But this works. — Hawklaz, Sep 20 '20 at 12:00

score 2 · Accepted Answer · answered Sep 19 '20 at 07:10

You can achieve this by using TF in the following way

from typing import Tuple

import tensorflow as tf


def split_train_test(features: tf.Tensor,
                     labels: tf.Tensor,
                     test_size: float,
                     random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:

    # Generate random masks
    random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
    train_mask = random >= test_size
    test_mask = random < test_size

    # Gather values
    train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
    test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)

    return train_features, test_features, train_labels, test_labels

What we are doing here is first creating a random uniform tensor with the size of the length of the data. Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

1 Answers1