1

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.

I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.

I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).

I am looking for a way to do the same using built-in methods in Tensorflow.

There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:

import tensorflow as tf

X_train, X_test, y_train, y_test = tf.train_test_split(features,
                                                labels,
                                                test_size=0.1,
                                                random_state=123)
Hawklaz
  • 306
  • 4
  • 20

1 Answers1

2

You can achieve this by using TF in the following way

from typing import Tuple

import tensorflow as tf


def split_train_test(features: tf.Tensor,
                     labels: tf.Tensor,
                     test_size: float,
                     random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:

    # Generate random masks
    random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
    train_mask = random >= test_size
    test_mask = random < test_size

    # Gather values
    train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
    test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)

    return train_features, test_features, train_labels, test_labels

What we are doing here is first creating a random uniform tensor with the size of the length of the data. Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

bluesummers
  • 11,365
  • 8
  • 72
  • 108