0

I have a very large array of images (multiple GBs) and want to split it using numpy. This is my code:

images = ... # this is the very large array which contains a lot of images.
images.shape => (50000, 256, 256)

indices = ... # array containing ranges, that group the images array like [(0, 300), (301, 580), (581, 860), ...]

train_indices, test_indices = ... # both arrays contain indices like [1, 6, 8, 19] which determine which groups are in the train and which are in the test group

images_train, images_test = np.empty([0, images.shape[1], images.shape[2]]), np.empty([0, images.shape[1], images.shape[2]])

# assign the image groups to either train or test set
for (i, rng) in enumerate(indices):
    group_range = range(rng[0], rng[1]+1)
    if i in train_indices:
        images_train = np.concatenate((images_train, images[group_range]))
    else:
        images_test = np.concatenate((images_test, images[group_range]))

The problem with this code is, that images_train and images_test are new arrays and the single images are always copied in this new array. This leads to double the memory needed to run the program.

Is there a way to split my images array into images_train and images_test without having to copy the images, but rather reuse them?

My intention with the indices is to group the images into roughly 150 groups, where images from one group should be either in the train or test set

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Codey
  • 1,131
  • 2
  • 15
  • 34
  • Would those ranges represent slices of equal lengths? – Divakar Jun 10 '20 at 14:14
  • No, the ranges vary. So `indices` is more like `[(0, 300), (301, 578), (579, 850), ...]` – Codey Jun 10 '20 at 14:16
  • Individual slices will be `views`, but concatenating them creates a copy. – hpaulj Jun 10 '20 at 14:36
  • Yes I've read that from here https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html. But is there an approach that could help me? – Codey Jun 10 '20 at 15:03
  • Since you are trying to concatenate rather than copying, the question have already been asked here: https://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying – SeF Jun 10 '20 at 17:31

1 Answers1

1

Without a running code it's difficult to understand the details. But I can try to give some ideas. If you have images_train and images_test then you will probabely use them to train and to test with a command that is something like

.fit(images_train);
.score(images_test)

An approach might be that you do not build images_train and images_test but that you use part of images directely

.fit(images[...]);
.score(images[...])

Now the question is, what should be in the [...]-brackets ? Or is there a numpy operater that extracts the right images[...]. First we have to think about what we should avoid:

  • for loop is always slow
  • iterative filling of an array like A = np.concatenate((A, B[j])) is always slow
  • Python's "fancy indexing" is always slow, as group_range = range(rng[0], rng[1]+1); images[group_range]

Some ideas:

  • use slices instead of "fancy indexing" see here
  • images[rng[0] : rng[1]+1] , or
  • group_range = slice(rng[0] , rng[1]+1); images[group_range]

  • Is images_train = images[train_indices, :, :] and images_test = images[test_indices, :, :] ?

  • images.shape => (50000, 256, 256) is 3-dimensional ?
  • try wether numpy.where can give some assitance
  • below the methods I've mentioned

...

import numpy as np

 A = np.arange(20); print("A =",A)
 B = A[5:16:2]; print("B =",B)                   # view of A only, faster
 j = slice(5, 16, 2); C = A[j]; print("C =",C)   # view of A only, faster
 k = [2, 4, 8, 12]; D = A[k]; print("D =",D)    # generates internal copies

 A = [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
 B = [ 5  7  9 11 13 15]
 C = [ 5  7  9 11 13 15]
 D = [ 2  4  8 12]
pyano
  • 1,885
  • 10
  • 28