Coverting big list of 2D elements to 3D NumPy array - memory problem

Question

I've problem with memory during converting big list of 2D elements into the 3D numpy array. I'm using CoLab enviroment. I'm doing deep learning project connected with medical images (.nii), CNN network. These images are float type (because of standarization). I'm loading images (one channel) into the memory as list, then I divide it into small pieces (11x11 resolution). As a result I have list of 11650348 - 11x11 images.

Get sequences. Memory info:

Gen RAM Free: 12.8 GB | Proc size: 733.4 MB

GPU RAM Free: 15079MB | Used: 0MB | Util 0% | Total 15079MB

get seqences...

Time: 109.60107789899996

Gen RAM Free: 11.4 GB | Proc size: 2.8 GB

GPU RAM Free: 15079MB | Used: 0MB | Util 0% | Total 15079MB

[INFO] data matrix in list of 11507902 images

Now I'm using np.array method to convert list into array.

Memory info:

Gen RAM Free: 11.8 GB | Proc size: 2.1 GB

GPU RAM Free: 15079MB | Used: 0MB | Util 0% | Total 15079MB

Coverting....

Gen RAM Free: 6.7 GB | Proc size: 7.3 GB

GPU RAM Free: 15079MB | Used: 0MB | Util 0% | Total 15079MB

Shape of our training data: (11650348, 11, 11, 1) SPLIT! See code below.

As you can see, I've lost a lot of memory. Why it happens?

I've try to use np.asarray, np.array with parameter copy. It didin't work.

Code responsible for dividing original image.

def get_parts(image, segmented):
    T2 = image[0]
    seg = segmented[0]
    labels = []
    val = [];
    window_width = 5
    zlen, ylen, xlen = T2.shape
    nda = np.zeros((240, 240))
    for x in range(0, xlen):
        for y in range(0, ylen):
            for z in range(0, zlen):
                if T2[z, y, x] != 0:
                    xbegin = x - window_width
                    xend = x + window_width + 1
                    ybegin = y - window_width
                    yend = y + window_width + 1
                    val.append(T2[z, ybegin:yend, xbegin:xend])
                    labels.append(seg[z, y, x])
    #np_array_01 = np.asarray(val)
    #np_array_02 = np.asarray(labels)
    return val, labels

Get values

for x in range(0, length):
   data, labels = get_parts(T2_images[x], segmented[x])
   uber_dane.extend(data)
   uber_label.extend(labels)

I'm transforming it in that way.

X_train, X_test, y_train, y_test = train_test_split(uber_dane, uber_label,test_size=0.2, random_state=0)
#LABELS
y_train = np.array(y_train)
y_test= np.array(y_test)
y_train = np.expand_dims(y_train, axis=3)
y_test = np.expand_dims(y_test, axis=3)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

#DATA - HERE IS A PROBLEM
X_train = np.array(X_train)
X_test= np.array(X_test)
print(sys.getsizeof(X_train))
print(sys.getsizeof(X_test))
X_train = np.expand_dims(X_train, axis=4)
X_test = np.expand_dims(X_test, axis=4)

What do you think about it? Maybe I'm doing something wrong. Array should take less memory than list :/ I did some searches through stackoverflow and the Internet, but it did not help. I could not help myself.

I hope, you will have some good ideas :D

UPDATE 08-06-2019

I've ran my code in pyCharm, different error:

X_train = np.array(uber_dane) ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.

I've got: Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 17:26:49) [MSC v.1900 32 bit (Intel)] on win32 So python is trying to allocate more than 3GB.

lmfit minimize fails with ValueError: array is too big

What do you think?

You can try `numpy.memmap` to avoid loading the array into memory (https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html). — jpkotta, Jun 06 '19 at 20:46
I will try it tommorow :) I need to put whole data into network. I guess.I had a problem with HDF5 too, it's taken too long time to save data. — Damian, Jun 06 '19 at 20:58

dcolazin · Answer 1 · 2019-06-06T20:04:59.133

0

Are you planning to use fit, evaluate or predict? If so, you can try to load only some data with a custom generator and use fit_generator (evaluate_generator, ...)

edited Jun 06 '19 at 20:04

answered Jun 06 '19 at 19:47

dcolazin

831
1
10
25

Yup, I'm planning to use them. I'm not sure if geenerator will help me. I have to make an algorithm to brain tumor segmentation, every brain tumor is different, shape, length and position. – Damian Jun 06 '19 at 20:54
A generator does not "generate" more data, but simply loads the data gradually (aka instead of loading 11650348 images, you can load easily 10000 images in memory each time and the generator will load the next 10000 when requested by `fit`) – dcolazin Jun 06 '19 at 22:29
You are right! I'm really sorry, my bad. I thought about ImageDataGenerator. I will try it! – Damian Jun 08 '19 at 14:31
But what with gradient in CNN? When you will put 10000 images, so it is different if you will put 100000 as I know. – Damian Jun 08 '19 at 15:43
@Damian Isn't the gradient calculated every batch step? If batch size remains the same there will be no differences. – dcolazin Jun 09 '19 at 03:30
Could you help me with that? I cannot load only 10000 images form 1 000 000 each batch. I guess the biggest problem is a data shuffle. First of all I need to load main image and then I divide it into small pieces. If I want to load 5 images, I will get 6-7 millions of small images. Scenario looks like: 1. Load 1st image 2. Divide 3. Take 10000 4. When all sub-images were used load another image? – Damian Jun 28 '19 at 19:53
@Damian you should open another question – dcolazin Jun 28 '19 at 21:28

score 0 · Answer 2 · edited Jul 30 '19 at 22:18

0

When you create list of small pieces in fact you don't create list of numpy arrays but list of numpy views (see first note in this section). So this objects doesn't store all data but only refers to large array (in your case T2_images[i]).

You can observe it in this example (when you modify element of slice in fact you refer to source array):

x = np.arange(5)
y = x[:2]
y[0] = 3
print(x)

When you convert this list to three dimensional numpy array all required data must be copied from large array.

edited Jul 30 '19 at 22:18

Tomerikoo

18,379
16
47
61

answered Jul 30 '19 at 21:37

Michał Sadowski

1
3

So there is no solution? This is only way to change list into an array? – Damian Jul 31 '19 at 18:23
Do you really need (11650348, 11, 11) numpy array? Assuming that you are using float32 you need at least 11650348*11*11*4 bytes what equals around 5.2 GB (it is consistent with your observations). You can only change type to float16 to reduce required memory. – Michał Sadowski Aug 01 '19 at 22:55
But do you want to use it as batch to train CNN? Neural networks needs a lot memory for intermediate results so for sure it wouldn't fit into memory. Furthermore I don't think it's good idea to use all 11M pieces to train NN. If I were you I would draw some locations in large image and use it as mini batch (you can choose such batch size to be able to fit neural network in memory during training). Then you can move to another image or draw another batch of image pieces. – Michał Sadowski Aug 01 '19 at 22:58
I don't understand what do you mean. Can we go talk about it through private messages? – Damian Aug 02 '19 at 11:17

Coverting big list of 2D elements to 3D NumPy array - memory problem

UPDATE 08-06-2019

2 Answers2