Numpy concatenation 3D array - Out of Memory - Bigger dataset

Question

I have run into an Out of Memory problem while running a python script. The trace reads -

490426.070081] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=python3,pid=18456,uid=1003
[490426.070085] Out of memory: Killed process 18456 (python3) total-vm:82439932kB, anon-rss:63127200kB, file-rss:4kB, shmem-rss:0kB
[490427.453131] oom_reaper: reaped process 18456 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I strongly suspect it is because of the concatenations I do in the script when the smaller test sample script was applied a larger dataset of 105,000 entries.

So a bit of overview of how my script looks. I have about 105,000 rows of timestamps and other data.

dataset -
2020-05-24T10:44:37.923792|[0.0, 0.0, -0.246047720313072, 0.0]
2020-05-24T10:44:36.669264|[1.0, 1.0, 0.0, 0.0]
2020-05-24T10:44:37.174584|[1.0, 1.0, 0.0, 0.0]
2020-05-24T10:57:53.345618|[0.0, 0.0, 0.0, 0.0]

For each Nth timestamp there are N*3 images. For example - 4 timestamps = 12 images. I would like to concatenate all the 3 images for every timestamp as one in axis = 2. Result dimension would be 70x320x9. Then go through all the rows in such a way and get an end tensor of dimension Nx70x320x9

I solved that with help from here -- Python - Numpy 3D array - concatenate issues using dictionary for each timestamp and concatenating later.

collected_images[timepoint].append(image)
.
.
.
output = []
for key, val in collected_iamges.items():
    temp = np.concatenate(val, axis=2)
    output.append(temp[np.newaxis, ...])

output = np.concatenate(output, axis=0)

However,as you would've guessed when applied to 105K timestamps(105K *3 images), the script crashes with OOM. This is where I seek your help.

I'm looking for ideas to solve this bottleneck. What other strategy can I use to accomplish my requirement.
Is it possible to do some modifications to overcome the kernel OOM behaviour temporarily?

Is there a need to hold all those rows in memory? Can you batch process and append output to MD5? — eNc, Jun 10 '20 at 14:54
I'm sorry did you mean hd5? As far as I'm aware MD5 is used for hash function? If it is hd5, I'm going to write into a hdf5 file after all these tasks. If it is indeed MD5, can you please guide me how I can do that? — Deepak, Jun 10 '20 at 16:19
Yes I meant hd5. It seems to me like you may be holding too much data in memory when you don't really need to. Check out what this guy did, maybe it will give some inspiration https://stackoverflow.com/a/5559069/503835 — eNc, Jun 10 '20 at 16:23
Thanks. Yes, I want to split to batches but I'm running to error as I'm doing some logic mistake. I will consult that link and wait till others can perform give me some ideas. — Deepak, Jun 10 '20 at 16:40
I took your suggestion -- didn't load them all into memory by clearing the list for each iteration. Thanks! — Deepak, Jun 11 '20 at 15:52

Han-Kwang Nienhuys · Answer 1 · 2020-06-11T10:13:07.643

0

If you know the size of your dataset, you can generate a file-mapped array of a predefined size:

import numpy as np
n = 105000
a = np.memmap('array.dat', dtype='int16', mode='w+', shape=(n, 320, 7, 9))

You can use a as a numpy array, but it is stored on disk rather than in memory. Change the data type from int16 to whatever is suitable for your data (int8, float32, etc.).

You probably don't want to use slices like a[:, i, :, :] because those will be very slow.

edited Jun 11 '20 at 10:13

answered Jun 10 '20 at 22:13

Han-Kwang Nienhuys

3,084
2
12
31

Thank you. Just to clarify, after initialising the memmap, can I carry on with concatenation as I intended. Or should I make some changes? – Deepak Jun 11 '20 at 10:00
No: allocate the right size from the beginning and don't concatenate. This may require two passes: one to read/count the rows without the image data, one to read the image data. – Han-Kwang Nienhuys Jun 11 '20 at 10:15
Thanks. I solved the issue by modifying my logic. Your solution is something I will keep in mind and pass on to my team who face similar issues. – Deepak Jun 11 '20 at 15:51

score 0 · Answer 2 · answered Jun 11 '20 at 15:50

I solved the issue!

It took a while to revise my logic. The key change was to empty the list after every iteration and figuring out how to maintain the desired dimension. With a bit of help, I made changes to eliminate the dictionary and doing concatenation twice. Just used a list, appended it and concatenate at each iteration but emptied the 3 images' list for the next iteration. Doing this, saved loading everything in memory.

Here is the sample of that code-

collected_images = [] 
images_concat = [] 
collected_images.append(image) #appending each image 3 times.

concate_img = np.concatenate(collected_images, axis=2) #70x320x9
images_concat.append(concate_img) #nx70x320x9
collected_images = []

IMO it's better to avoid array concatenation altogether, especially on large arrays inside loops. I have been heavily using numpy on large arrays for years and I use `np.concatenate` so rarely that I can't even remember how the `axis` parameter works. — Han-Kwang Nienhuys, Jun 11 '20 at 16:31
Got it. I will keep that in mind. So you always do np.memmap for large arrays? — Deepak, Jun 11 '20 at 19:24

Numpy concatenation 3D array - Out of Memory - Bigger dataset

2 Answers2