How do I process a large dataset of images in python?

Question

I have a large dataset of around 10,000 image imported from Google drive, and I wish to turn them into a numpy array so I can train my machine learning model. The problem is that my way is taking too long and is very space-consuming on the RAM.

from PIL import Image
import glob  

train_images = glob.glob('/content/drive/MyDrive/AICW/trainy/train/*.jpg')

x_train = np.array([np.array(Image.open(image)) for image in train_images])

These lines of codes were still running even after 30 minutes and even when I managed to get a numpy array. It is a collection of images of different sizes and dimensions (eg some are 450 X 600 and others are 500 X 600), which is going to be problematic when I feed them into my model. There must be a way that's more time and space efficient right?

P.s I'm running all these on Google colab. The total number of images is 10,270. Size varies from image to image but they all generally have a size of 450 by 600 by 3.

Look in `Image` for a resizing method. Do that before trying to combine them into on array. What's your machine learning model? From some import like `keras`, or your own `numpy`? It's hard to tell from your description whether the slowness is due to the shear number of images, or if you are hitting a memory management limit. — hpaulj, Mar 26 '21 at 15:21
`450*600*3*10270/1e9` is 8Gb elements. Mutiply that by 1, 4, or 8 depending on the `dtype`. — hpaulj, Mar 26 '21 at 15:25
There's not much advantage to converting the last line's list into a numpy array, and you'll have more flexibility with a list (eg, for memory management). — tom10, Mar 26 '21 at 15:31
You can load them in parallel although this is cumbersome in Python (you need to use multiprocessing which is not well suited with you following computation... — Jérôme Richard, Mar 26 '21 at 16:32
It looks like you could use a generator to deal with the memory issue, it will come with a speed penalty, but you can probably load the next batch while training your model on the current one. Assuming you're using TensorFlow/Keras, I would suggest you read [tf.data](https://www.tensorflow.org/guide/data), and then [how to optimize it](https://www.tensorflow.org/guide/data_performance) — Mateo Torres, Mar 26 '21 at 19:13

score 0 · Answer 1 · answered Mar 26 '21 at 19:08

Lots of good suggestions in the comments (mostly importantly the total size of x_train if you don't resize the images). As noted, if you want to use arrays of different size, simply make x_train a list (instead of a np.array). Eventually you probably need to resize for training and testing. The Pillow docs show image conversion to NumPy array with .asarray(). Not sure if that matters.
I modified your code sightly to 1) create train_x as an empty array of dtype=object (to hold the image arrays), 2) resize the images and 3) use .asarray() to convert the images. It reads 26640 images into an array in a few seconds on a desktop system with 24 GB RAM.
Code below:

train_images = glob.glob('*/*.jpg', recursive=True)
x_train = np.empty(shape=(len(train_images),), dtype=object)
size = 128, 128

for i, image in enumerate(train_images):
    x_train[i] = np.asarray(Image.open(image).thumbnail(size))

How do I process a large dataset of images in python?

1 Answers1

Linked