how to split data into training and testing

Question

I got this dataset which is composed of images and their labels. I load them using a custom function that I've written. The problem where I stumbled upon is that I don't know how to appropriately split them in between training and testing. I have been looking at the TensorFlow documentation and I found something but they aren't explanatory enough.

def create_training_data():
    _images = []
    _labels = []
    for category in CATEGORIES:
        class_num = CATEGORIES.index(category)
        new_path = os.path.join(DATASET, category)
        for img in os.listdir(new_path):
            img_array = cv2.imread(os.path.join(new_path, img), cv2.IMREAD_GRAYSCALE)
            _images.append(img_array)
            _labels.append(class_num)
    return (_images, _labels)

This is how I load the data at the moment

(training_images, training_labels) = create_training_data()
training_images = np.array(training_images)
training_images = training_images / 255.0

How could I possibly split it in between training and testing with a testsize of 0.3?

Does this answer your question? [Stratified Train/Test-split in scikit-learn](https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn) — Hailin FU, Feb 25 '20 at 06:11

score 1 · Accepted Answer · answered Feb 25 '20 at 04:02

1

Try to use sklearn.model_selection.train_test_split, it can split your list dataset

X_train, X_test, y_train, y_test = train_test_split(
training_images, training_labels, test_size=0.3, random_state=42)

answered Feb 25 '20 at 04:02

Hailin FU

492
4
14

how to split data into training and testing

1 Answers1