Using train_test_split with images from my local directory

Question

I have read the images from my local directory as follows:

from PIL import Image
import os

root = '/Users/xyz/Desktop/data'

for path, subdirs, files in os.walk(root):
    for name in files:
        img_path = os.path.join(path,name)

I have two subdirectories: category-1 and category-2, each of which contains image files (.jpg) that belong to each category.

How can I use those images and two categories with the train_test_split() function in Scikit-Learn? In other words, to arrange the training and testing data?

Thanks.

score 3 · Answer 1 · edited May 23 '17 at 10:28

You have to read pixel data from images and store it in a Pandas DataFrame or a numpy array. At the same time, you have to store corresponding category values category-1 (1) and category-2 (2) in a list or numpy array. Here is a rought sketch: I am going to assume that you have some store categories that returns 1 or 2 based on image name.

X = numpy.array([])
y = list()

for path, subdirs, files in os.walk(root):
  for name in files:
    img_path = os.path.join(path,name)
    correct_cat = categories[img_path]
    img_pixels = list(Image.open(img_path).getdata())
    X = numpy.vstack((X, img_pixels))
    y.append(correct_cat)

You are effectively storing image pixels and category values (converted to integers). There could be alternative ways of doing that: Check this for example.

Once you have X and y lists, you can call train_test_split on them

X_train, X_test, y_train, y_test = train_test_split(X, y)

Nice. X can be a list. use np.array(X) to have a numpy array. — Nando, Aug 01 '21 at 16:15

Using train_test_split with images from my local directory

1 Answers1