0

I'm having trouble loading a lot of small image files (aprox. 90k png images) into a single 3D np.array. The current solution takes couple of hours which is unacceptable.

The images are size 64x128.

I have a pd.DataFrame called labels with the names of the images and want to import whose images in the same order as in the labels variable.

My current solution is:

dataset = np.empty([1, 64, 128], dtype=np.int32)

for file_name in labels['file_name']:
    array = cv.imread(f'{IMAGES_PATH}/{file_name}.png', cv.COLOR_BGR2GRAY)
    dataset = np.append(dataset, [array[:]], axis=0)

From what I have timed, the most time consuming operation is the dataset = np.append(dataset, [array[:]], axis=0), which takes around 0.4s per image.

Is there any better way to import such files and store them in a np.array?

I was thinking about multiprocessing, but I want the labels and dataset to be in the same order.

  • How about importing them into numpy array and saving the array? Also, you can have ordered results with multiprocessing or concurrent.futures, however, the runtime on this is probably defined by the disk speed. – Dimitry Jul 01 '21 at 21:56
  • 2
    If you know you have 90,000 images you can surely declare your array with size `[90000, 64, 128]` up-front rather than appending and reallocating 90,000 times? – Mark Setchell Jul 01 '21 at 22:12
  • load them into an array (or a group of arrays) and save them binary (using .npy or .npz for a case of multiple arrays) and then load them. If they all cannot fit into your memory at once, use memmap in numpy to access pieces of them at a time. – Ehsan Jul 01 '21 at 23:15
  • list append is faster. Make the big array from the list at the end. – hpaulj Jul 02 '21 at 01:10

1 Answers1

1

Game developers typically concatenate bunches of small images into a single big file and then use sizes and offsets to slice out the currently needed piece. Here's example of how this can be done with imagemagick:

montage -mode concatenate -tile 1x *.png out.png

But then again it will not get around the reading of 90k of small files. And magick has it's own peculiarities which may or may not surface in your case.

Also, I haven't originally noticed that you are having problem with np.append(dataset, [array[:]], axis=0). That is very bad line. Appending in a loop is never a performant code.

Either preallocate the array and write to it. Or use numpy's functions for concatenating many arrays at once:

dataset = np.empty([int(90e3), 64, 128], dtype=np.int32)
for i,file_name in enumerate(labels['file_name']):
    array = cv.imread(f'{IMAGES_PATH}/{file_name}.png', cv.COLOR_BGR2GRAY)
    dataset[i,:,:] = array
Dimitry
  • 2,204
  • 1
  • 16
  • 24