2

I have read the images from my local directory as follows:

from PIL import Image
import os

root = '/Users/xyz/Desktop/data'

for path, subdirs, files in os.walk(root):
    for name in files:
        img_path = os.path.join(path,name)

I have two subdirectories: category-1 and category-2, each of which contains image files (.jpg) that belong to each category.

How can I use those images and two categories with the train_test_split() function in Scikit-Learn? In other words, to arrange the training and testing data?

Thanks.

Simplicity
  • 47,404
  • 98
  • 256
  • 385

1 Answers1

3

You have to read pixel data from images and store it in a Pandas DataFrame or a numpy array. At the same time, you have to store corresponding category values category-1 (1) and category-2 (2) in a list or numpy array. Here is a rought sketch: I am going to assume that you have some store categories that returns 1 or 2 based on image name.

X = numpy.array([])
y = list()

for path, subdirs, files in os.walk(root):
  for name in files:
    img_path = os.path.join(path,name)
    correct_cat = categories[img_path]
    img_pixels = list(Image.open(img_path).getdata())
    X = numpy.vstack((X, img_pixels))
    y.append(correct_cat)

You are effectively storing image pixels and category values (converted to integers). There could be alternative ways of doing that: Check this for example.

Once you have X and y lists, you can call train_test_split on them

X_train, X_test, y_train, y_test = train_test_split(X, y)
Community
  • 1
  • 1
Sudeep Juvekar
  • 4,898
  • 3
  • 29
  • 35