Slicing without views (or: shuffling multiple arrays)

Question

I have two different numpy arrays and I would like to shuffle them in asynchronized way.

The current solution is taken from https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html and proceeds as follows:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
self.images_train = self.images_train[perm]
self.labels_train = self.labels_train[perm]

The problem is that it doubles memory each time I do it. Somehow the old arrays are not getting deleted, probably because the slicing operator creates views I guess. I tried the following change, out of pure desperation:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)

n_images_train = self.images_train[perm]
n_labels_train = self.labels_train[perm]            

del self.images_train
del self.labels_train
gc.collect()

self.images_train = n_images_train
self.labels_train = n_labels_train

Still the same, memory leaks and I am running out of memory after a couple of operations.

Btw, the two arrays are of rank 100000,224,244,1 and 100000,1.

I know that this has been dealt with here (Better way to shuffle two numpy arrays in unison), but the answer didn't help me, as the provided solution needs slicing again.

Thanks for any help.

Those aren't views. You may have other references to the original arrays somewhere. — user2357112, Jun 14 '16 at 19:26
*"...because the slicing operator creates views I guess."* Slicing *does* create views, but the code that you show is not slicing. When you write `a[perm]`, a copy is made. "Slicing" refers to the operation using a colon: `start:end:step`, e.g. `0:4`, `4:`, etc. — Warren Weckesser, Jun 14 '16 at 19:27
*"... in asynchronized way."* I think you are missing a space. Based on what follows, I think you meant "in a synchronized way." — Warren Weckesser, Jun 14 '16 at 19:28
*"...rank 100000,224,244,1..."* That's almost 5.5 gigabytes (assuming the data type is 8 bit). Even in your "desperation" code, there is a time when `self.images_train` and `n_images_train" will both exist, which will require 11 gigabytes. This is not a memory "leak". — Warren Weckesser, Jun 14 '16 at 19:40
I think a better title for this question is "How do I apply the same random permutation to two arrays without making temporary copies of the arrays?" — Warren Weckesser, Jun 14 '16 at 20:16
Thanks for the answers. Yes, the data is huge, that's why there is a problem. The problem is also not that the copies are _temporary_. They are _not_. The data stays around. This is a real leak. — Flonks, Jun 14 '16 at 20:23
In that case, there could be an underlying bug in numpy, and you might consider filing a bug report at https://github.com/numpy/numpy/issues If you do, a MCVE (http://stackoverflow.com/help/mcve) would be helpful--in fact, it would be helpful here, too. In the meantime, see if my answer works for you. — Warren Weckesser, Jun 14 '16 at 20:55

Warren Weckesser · Answer 1 · 2016-06-14T19:53:05.320

1

One way to permute two large arrays in-place in a synchronized way is to save the state of the random number generator and then shuffle the first array. Then restore the state and shuffle the second array.

For example, here are my two arrays:

In [48]: a
Out[48]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [49]: b
Out[49]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Save the current internal state of the random number generator:

In [50]: state = np.random.get_state()

Shuffle a in-place:

In [51]: np.random.shuffle(a)

Restore the internal state of the random number generator:

In [52]: np.random.set_state(state)

Shuffle b in-place:

In [53]: np.random.shuffle(b)

Check that the permutations are the same:

In [54]: a
Out[54]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

In [55]: b
Out[55]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

For your code, this would look like:

state = np.random.get_state()
np.random.shuffle(self.images_train)
np.random.set_state(state)
np.random.shuffle(self.labels_train)

edited Jun 14 '16 at 19:53

answered Jun 14 '16 at 19:48

Warren Weckesser

110,654
19
194
214

It does help, thank you. However, I actually found a better thing: to circumvent the problem. I decided not to shuffle the data periodically, but to only recreate a permutation vector and to sample using it. I'd still like to know why the original solution fails. – Flonks Jun 15 '16 at 13:47
However this solution need two calls to the random number generator, which may become a performance bottleneck. You may use a different random number generator to reduce this effect. – Guillaum Jun 15 '16 at 16:19
@Guillaum Yes, the two calls to the random number generator (to generate the same sequence!) might be an issue, so some performance testing is recommended. How would using a different generator help? – Warren Weckesser Jun 15 '16 at 16:49
@WarrenWeckesser As far as I know, the random generator of numpy is a Mersenne Twister. It exists random number generators with different behavior (on quality and speed). For example, using C++ `std::mt19937_64` (Mersenne Twister) or `std::minstd_rand` (a simpler approach) to generate 10 millions random numbers runs in 8.3s versus 1.0s on my computer. However I was thinking that numpy comes with different generator but I was wrong. – Guillaum Jun 15 '16 at 17:52

score 0 · Answer 2 · edited May 23 '17 at 11:58

0

Actually I don't think that there is any issue with numpy or python. Numpy uses the system malloc / free to allocate the array and this leads to memory fragmentation (see Memory Fragmentation on SO).

So I guess that your memory profile may increase and suddenly drops when the system is able to reduce fragmentation, if possible.

edited May 23 '17 at 11:58

Community

1
1

answered Jun 15 '16 at 16:24

Guillaum

170
6

Memory increases in steps of 6GB, and at 230GB I kill the process on my machine with 64GB of physical memory. I am not sure that this can be totally attributed to fragmentation. Especially since there is no real reason why there should be more than 6GB memory used over longer periods of time (apart from temporal allocations for copying etc.). – Flonks Jun 16 '16 at 18:13

Slicing without views (or: shuffling multiple arrays)

2 Answers2