1

In my program I fill a large numpy array with elements, number of which I do not know in advance. Since adding single element per go to a numpy array is inefficient, I increase its size by chunks of length 10000 initialized with zeros. This leads to the situation that in the end I have an array with tail of zeros. And what I would like to have is the array, whose length is precisely number of meaningful elements (because later I cannot distinguish junky zeros from actual data points with zero value). Straightforward copying of slicing, however, doubles the RAM consumption, which is really undesirable since my arrays are quite large. I looked into numpy.split functions, but they seem to split arrays only into chunks of the equal size, which of course does not suit me.

I illustrate the problem with the following code:

import numpy, os, random

def check_memory(mode_peak = True, mark = ''):
    """Function for measuring the memory consumption (Linux only)"""
    pid = os.getpid()
    with open('/proc/{}/status'.format(pid), 'r') as ifile:
        for line in ifile:
            if line.startswith('VmPeak' if mode_peak else 'VmSize'):
                memory = line[: -1].split(':')[1].strip().split()[0]
                memory = int(memory) / (1024 * 1024)
                break
    mode_str = 'Peak' if mode_peak else 'Current'
    print('{}{} RAM consumption: {:.3f} GB'.format(mark, mode_str, memory))

def generate_element():
    """Test element generator"""
    for i in range(12345678):
        yield numpy.array(random.randrange(0, 1000), dtype = 'i4')

check_memory(mode_peak = False, mark = '#1 ')
a = numpy.zeros(10000, dtype = 'i4')
i = 0
for element in generate_element():
    if i == len(a):
        a = numpy.concatenate((a, numpy.zeros(10000, dtype = 'i4')))
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
a = a[: i]
check_memory(mode_peak = False, mark = '#3 ')
check_memory(mode_peak = True, mark = '#4 ')

This outputs:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#3 Current RAM consumption: 0.118 GB
#4 Peak RAM consumption: 0.164 GB

Can anyone help me to find a solution that does not penalize significantly runtime or RAM consumption?

Edit:

I tried to use

a = numpy.delete(a, numpy.s_[i: ])

as well as

a = numpy.split(a, (i, ))[0]

However, it results in the same doubled memory consumption

Roman
  • 2,225
  • 5
  • 26
  • 55
  • Probably speed is unimportant to you relative to memory, but I don't know how to test memory consumption on my system (mac os x). In any case, it's about 2x faster for me to build a list and then convert to array at the end. Fastest for me (though I don't know how it actually is implemented) is `np.fromiter`, but I assume your generator is just for testing and not what you're actually using. Also, if you yield scalars instead of arrays (as your `element`) that will be much faster of course, unless each `element` will actually have some length in your use case. – askewchan Aug 30 '15 at 21:37
  • @askewchan In my case the `array` generation is a step in big program that contributes only a tiny fraction of total runtime, therefore the speed is not critical. On the other hand, this step was the memory bottleneck. And the generator is of course much more complex and involves receiving data from the network. – Roman Sep 02 '15 at 07:38

2 Answers2

9

numpy.split does not have to split the array into equal-sized chunks. If you use the indices_or_sections parameter, you can give a list of integers, which it will use to split the array. For example:

>>> x = np.arange(8.0)
>>> np.split(x, [3, 5, 6, 10])
[array([ 0.,  1.,  2.]),   # x[:3]
 array([ 3.,  4.]),        # x[3:5]
 array([ 5.]),             # x[5:6]
 array([ 6.,  7.]),        # x[6:10]
 array([], dtype=float64)] # x[10:]
tmdavison
  • 64,360
  • 12
  • 187
  • 165
  • My mistake that I did not understand the syntax of `numpy.split` correctly. However, this results in the same doubled RAM consumption as slicing: `a = numpy.split(a, (i,))[0]` – Roman Aug 28 '15 at 12:45
1

Finally I figured it out. In fact, extra memory was consumed not only during trimming stage, but also during the concatenation. Therefore, introducing a peak memory check at the point #2 outputs:

#2 Peak RAM consumption: 0.164 GB

However, there is the resize() method, which changes the size/shape of an array in-place:

check_memory(mode_peak = False, mark = '#1 ')
page_size = 10000
a = numpy.zeros(page_size, dtype = 'i4')
i = 0
for element in generate_element():
    if (i != 0) and (i % page_size == 0):
        a.resize(i + page_size)
    a[i] = element
    i += 1
a.resize(i)
check_memory(mode_peak = False, mark = '#2 ')
check_memory(mode_peak = True, mark = '#2 ')

This leads to output:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#2 Peak RAM consumption: 0.118 GB

In addition, as there are no more reallocations, the performance improved significantly as well.

Roman
  • 2,225
  • 5
  • 26
  • 55