2

Like many others, my situation is that I have a class which collects a large amount of data, and provides a method to return the data as a numpy array. (Additional data can continue to flow in, even after returning an array). Since creating the array is an expensive operation, I want to create it only when necessary, and to do it as efficiently as possible (specifically, to append data in-place when possible).

For that, I've been doing some reading about the ndarray.resize() method, and the refcheck argument. I understand that refcheck should be set to False only when "you are sure that you have not shared the memory for this array with another Python object".

The thing is I'm not sure. Sometimes I have, sometimes I haven't. I'm fine with it raising an error if refcehck fails (I can catch it and then create a new copy), but I want it to fail only when there are "real" external references, ignoring the ones I know to be safe.

Here's a simplified illustration:

import numpy as np

def array_append(arr, values, refcheck = True):
    added_len = len(values)
    if added_len == 0:
        return arr
    old_len = len(arr)
    new_len = old_len + added_len
    arr.resize(new_len, refcheck = refcheck)
    arr[old_len:] = values
    return arr

class DataCollector(object):

    def __init__(self):
        self._new_data = []
        self._arr = np.array([])

    def add_data(self, data):
        self._new_data.append(data)

    def get_data_as_array(self):
        self._flush()
        return self._arr

    def _flush(self):
        if not self._new_data:
            return
#        self._arr = self._append1()
#        self._arr = self._append2()
        self._arr = self._append3()
        self._new_data = []

    def _append1(self):
        # always raises an error, because there are at least 2 refs:
        # self._arr and local variable 'arr' in array_append()
        return array_append(self._arr, self._new_data, refcheck = True)

    def _append2(self):
        # Does not raise an error, but unsafe in case there are other
        # references to self._arr
        return array_append(self._arr, self._new_data, refcheck = False)

    def _append3(self):
        # "inline" version: works if there are no other references
        # to self._arr, but raises an error if there are.
        added_len = len(self._new_data)
        old_len = len(self._arr)
        self._arr.resize(old_len + added_len, refcheck = True)
        self._arr[old_len:] = self._new_data
        return self._arr

dc = DataCollector()
dc.add_data(0)
dc.add_data(1)
print dc.get_data_as_array()
dc.add_data(2)
print dc.get_data_as_array()
x = dc.get_data_as_array()  # create an external reference
print x.shape
for i in xrange(5000):
    dc.add_data(999)
print dc.get_data_as_array()
print x.shape

Questions:

  1. Is there a better (fast) way of doing what I'm trying to do (creating numpy array incrementally)?
  2. Is there a way of telling the resize() method: "perform refcheck, but ignore that one reference (or n references) which I know to be safe"? (that would solve the problem that _append1() always fails)
shx2
  • 61,779
  • 13
  • 130
  • 153

2 Answers2

1

I will use array.array() to do the data collection:

import array
a = array.array("d")
for i in xrange(100):
    a.append(i*2)

Every time when you want to do some calculation with the collected data, convert it to numpy.ndarray by numpy.frombuffer:

b = np.frombuffer(a, dtype=float)
print np.mean(b)

b will share data memory with a, so the convertion is very fast.

HYRY
  • 94,853
  • 25
  • 187
  • 187
  • That is cool and solves the problem for numeric types. However, in my real-life problem, I collect records (and create a recarray from them), with numbers, strings, and objects. – shx2 Mar 05 '13 at 14:13
  • Also, @HYRY can you please explain how this data-sharing works in your example? What happens when the shared data needs to be reallocated when a.append() is called again? – shx2 Mar 05 '13 at 17:48
  • `a.append()` may reallocate the memory, so you can't use `ndarray` that created before, every time you need to do calculation with the data, you create the `ndarray` by `np.frombuffer(a)`. – HYRY Mar 05 '13 at 22:12
  • And if I still hold a reference to the old result of `np.frombuffer(a)`? Will it still be usable? I mean, does it get corrupted, or just keeps the old memory from getting garbage-collected (due to ref-counting)? – shx2 Mar 06 '13 at 06:21
  • The old memory is not usable, the memory is freed by `a.append()`. – HYRY Mar 06 '13 at 08:34
  • 1
    In that case this isn't a valid solution to my problem. I can't give the caller a reference which later becomes unusuble... – shx2 Mar 06 '13 at 09:08
1

The resize method has two main problems. The first is that you return a reference to self._arr when the user calls get_data_as_array. Now the resize will do one of two things depending on your implementation. It'll either modify the array you've given you're user ie the user will take a.shape and the shape will unpredictably change. Or it'll corrupt that array, having it point to bad memory. You could solve that issue by always having get_data_as_array return self._arr.copy(), but that brings me to the second issue. resize is acctually not very efficient. I believe in general, resize has to allocate new memory and do a copy every time it is called to grow an array. Plus now you need to copy the array every time you want to return it to your user.

Another approach would be to design your own dynamic array, that would look something like:

class DynamicArray(object):

    _data = np.empty(1)
    data = _data[:0]
    len = 0
    scale_factor = 2

    def append(self, values):
        old_data = len(self.data)
        total_data = len(values) + old_data
        total_storage = len(self._data)
        if total_storage < total_data:
            while total_storage < total_data:
                total_storage = np.ceil(total_storage * self.scale_factor)
            self._data = np.empty(total_storage)
            self._data[:old_data] = self.data

        self._data[old_data:total_data] = values
        self.data = self._data[:total_data]

This should be very fast because you only need to grow the array log(N) times and you use at most 2*N-1 storage where N is the max size of the array. Other than growing the array, you're just making views of _data which doesn't involve any copying and should be constant time.

Hope this is useful.

Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • The downside of this solution is that it uses on average 50% more memory, which in my case means a few GBs. But of course, there's a tradeoff between size and speed, and in my case speed is probably more important. I'll give it a try soon. – shx2 Mar 07 '13 at 07:08