6

I have been running some code, a part of which loads in a large 1D numpy array from a binary file, and then alters the array using the numpy.where() method.

Here is an example of the operations performed in the code:

import numpy as np
num = 2048
threshold = 0.5

with open(file, 'rb') as f:
    arr = np.fromfile(f, dtype=np.float32, count=num**3)
    arr *= threshold

arr = np.where(arr >= 1.0, 1.0, arr)
vol_avg = np.sum(arr)/(num**3)

# both arr and vol_avg needed later

I have run this many times (on a free machine, i.e. no other inhibiting CPU or memory usage) with no issue. But recently I have noticed that sometimes the code hangs for an extended period of time, making the runtime an order of magnitude longer. On these occasions I have been monitoring %CPU and memory usage (using gnome system monitor), and found that python's CPU usage drops to 0%.

Using basic prints in between the above operations to debug, it seems to be arbitrary as to which operation causes the pausing (i.e. open(), np.fromfile(), np.where() have each separately caused a hang on a random run). It is as if I am being throttled randomly, because on other runs there are no hangs.

I have considered things like garbage collection or this question, but I cannot see any obvious relation to my problem (for example keystrokes have no effect).

Further notes: the binary file is 32GB, the machine (running Linux) has 256GB memory. I am running this code remotely, via an ssh session.

EDIT: This may be incidental, but I have noticed that there are no hang ups if I run the code after the machine has just been rebooted. It seems they begin to happen after a couple of runs, or at least other usage of the system.

Community
  • 1
  • 1
  • I don't really think this is the case, but is your gnome sys monitor monitoring the thread instead of the process? I believe numpy forks a new thread to do heavy computations, but it should be under the same process group. then again I may be completely wrong – Aaron Jan 30 '17 at 18:34
  • So memory usage isn't the problem, or can you sometimes see paging going on when python hangs? Is your `file` on a local disk on the machine you're using, or is it accessed using NFS or similar? Network I/O could be the culprit, and slowdowns could show up randomly depending on what other users are doing. – wildwilhelm Jan 30 '17 at 19:05
  • @wildwilhelm The file isn't stored on the machine's local disk, so perhaps it is a network I/O problem. I will do further runs and monitor the network upload/download to see if there's a correlation! –  Jan 30 '17 at 19:17

3 Answers3

1

np.where is creating a copy there and assigning it back into arr. So, we could optimize on memory there by avoiding a copying step, like so -

vol_avg = (np.sum(arr) - (arr[arr >=  1.0] - 1.0).sum())/(num**3)

We are using boolean-indexing to select the elements that are greater than 1.0 and getting their offsets from 1.0 and summing those up and subtracting from the total sum. Hopefully the number of such exceeding elements are less and as such won't incur anymore noticeable memory requirement. I am assuming this hanging up issue with large arrays is a memory based one.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • you think malloc is causing the hang in OS temporarily? – Aaron Jan 30 '17 at 18:36
  • @Aaron Not sure if memory is the only culprit there, but hoping if memory is the issue, this should help lessen the memory requirement of this problem, especially since OP mentioned this is happening with large arrays. – Divakar Jan 30 '17 at 18:37
  • Are you sure that saves anything? I'd think `arr >= 1.0`, then `arr[arr >= 1.0]`, then `arr[arr >= 1.0] - 1.0` would each involve making new arrays, the first of booleans, then a subset of the original array, then a mutated copy of the original array. When the expression finishes, they'd all be cleaned up, but the filter expression would occupy roughly a quarter of the RAM of the original array, and the two intermediates would occupy data proportional to the number of values that passed the filter. The OP's code doubles the required memory (briefly), but has consistent cost. – ShadowRanger Jan 30 '17 at 18:59
  • @ShadowRanger Well `arr >= 1.0` would be a boolean array that on Linux systems occupy 8 times lesser memory. Thereafter, when indexing into `arr` and as stated/assumed in the post that those numbers are lesser than the number of elems in `arr`, the memory requirement should be lesser than making a copy altogether. So, to sum up `arr[arr >= 1.0] - 1.0` would be a smaller array than `arr`, which is summed up to result in a scalar and other terms are : `np.sum(arr)`, which is another scalar. Does that clarify your queries? – Divakar Jan 30 '17 at 19:03
  • Thanks for the response Divakar. One immediate problem is that I still need `arr` later on in the code; that is to say, `vol_avg` is not the final output. Also, if it is a memory issue in the sense that you have described, why does it happen inconsistently? –  Jan 30 '17 at 19:12
  • @llap42 Well if you need that clipped version obtained with `np.where`, then this post won't be of any help. The inconsistency could be because as you mentioned its through SSH, so the resources are shared and someone else might be running some heavy programs too at times? Sorry, no concrete solutions I guess for you! – Divakar Jan 30 '17 at 19:15
  • @Divakar Yes the resource is shared, although as I said in the post, at time of running the machine was not being used. It could be a network I/O problem with the disk though, as wildwilheim suggested above. Thanks for the help anyway! –  Jan 30 '17 at 19:23
0

The drops in CPU usage were unrelated to python or numpy, but were indeed a result of reading from a shared disk, and network I/O was the real culprit. For such large arrays, reading into memory can be a major bottleneck.

0

Did you click or select the Console window? This behavior can "hang" the process. Console enters "QuickEditMode". Pressing any key can resume the process.