Split of numpy array into unequal chunks

Question

In my program I fill a large numpy array with elements, number of which I do not know in advance. Since adding single element per go to a numpy array is inefficient, I increase its size by chunks of length 10000 initialized with zeros. This leads to the situation that in the end I have an array with tail of zeros. And what I would like to have is the array, whose length is precisely number of meaningful elements (because later I cannot distinguish junky zeros from actual data points with zero value). Straightforward copying of slicing, however, doubles the RAM consumption, which is really undesirable since my arrays are quite large. I looked into numpy.split functions, but they seem to split arrays only into chunks of the equal size, which of course does not suit me.

I illustrate the problem with the following code:

import numpy, os, random

def check_memory(mode_peak = True, mark = ''):
    """Function for measuring the memory consumption (Linux only)"""
    pid = os.getpid()
    with open('/proc/{}/status'.format(pid), 'r') as ifile:
        for line in ifile:
            if line.startswith('VmPeak' if mode_peak else 'VmSize'):
                memory = line[: -1].split(':')[1].strip().split()[0]
                memory = int(memory) / (1024 * 1024)
                break
    mode_str = 'Peak' if mode_peak else 'Current'
    print('{}{} RAM consumption: {:.3f} GB'.format(mark, mode_str, memory))

def generate_element():
    """Test element generator"""
    for i in range(12345678):
        yield numpy.array(random.randrange(0, 1000), dtype = 'i4')

check_memory(mode_peak = False, mark = '#1 ')
a = numpy.zeros(10000, dtype = 'i4')
i = 0
for element in generate_element():
    if i == len(a):
        a = numpy.concatenate((a, numpy.zeros(10000, dtype = 'i4')))
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
a = a[: i]
check_memory(mode_peak = False, mark = '#3 ')
check_memory(mode_peak = True, mark = '#4 ')

This outputs:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#3 Current RAM consumption: 0.118 GB
#4 Peak RAM consumption: 0.164 GB

Can anyone help me to find a solution that does not penalize significantly runtime or RAM consumption?

Edit:

I tried to use

a = numpy.delete(a, numpy.s_[i: ])

as well as

a = numpy.split(a, (i, ))[0]

However, it results in the same doubled memory consumption

Probably speed is unimportant to you relative to memory, but I don't know how to test memory consumption on my system (mac os x). In any case, it's about 2x faster for me to build a list and then convert to array at the end. Fastest for me (though I don't know how it actually is implemented) is `np.fromiter`, but I assume your generator is just for testing and not what you're actually using. Also, if you yield scalars instead of arrays (as your `element`) that will be much faster of course, unless each `element` will actually have some length in your use case. — askewchan, Aug 30 '15 at 21:37
@askewchan In my case the `array` generation is a step in big program that contributes only a tiny fraction of total runtime, therefore the speed is not critical. On the other hand, this step was the memory bottleneck. And the generator is of course much more complex and involves receiving data from the network. — Roman, Sep 02 '15 at 07:38

score 9 · Answer 1 · answered Aug 28 '15 at 12:34

9

numpy.split does not have to split the array into equal-sized chunks. If you use the indices_or_sections parameter, you can give a list of integers, which it will use to split the array. For example:

>>> x = np.arange(8.0)
>>> np.split(x, [3, 5, 6, 10])
[array([ 0.,  1.,  2.]),   # x[:3]
 array([ 3.,  4.]),        # x[3:5]
 array([ 5.]),             # x[5:6]
 array([ 6.,  7.]),        # x[6:10]
 array([], dtype=float64)] # x[10:]

answered Aug 28 '15 at 12:34

tmdavison

64,360
12
187
165

My mistake that I did not understand the syntax of `numpy.split` correctly. However, this results in the same doubled RAM consumption as slicing: `a = numpy.split(a, (i,))[0]` – Roman Aug 28 '15 at 12:45

score 1 · Accepted Answer · answered Aug 28 '15 at 15:27

Finally I figured it out. In fact, extra memory was consumed not only during trimming stage, but also during the concatenation. Therefore, introducing a peak memory check at the point #2 outputs:

#2 Peak RAM consumption: 0.164 GB

However, there is the resize() method, which changes the size/shape of an array in-place:

check_memory(mode_peak = False, mark = '#1 ')
page_size = 10000
a = numpy.zeros(page_size, dtype = 'i4')
i = 0
for element in generate_element():
    if (i != 0) and (i % page_size == 0):
        a.resize(i + page_size)
    a[i] = element
    i += 1
a.resize(i)
check_memory(mode_peak = False, mark = '#2 ')
check_memory(mode_peak = True, mark = '#2 ')

This leads to output:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#2 Peak RAM consumption: 0.118 GB

In addition, as there are no more reallocations, the performance improved significantly as well.

Split of numpy array into unequal chunks

2 Answers2

Linked