Like many others, my situation is that I have a class which collects a large amount of data, and provides a method to return the data as a numpy array. (Additional data can continue to flow in, even after returning an array). Since creating the array is an expensive operation, I want to create it only when necessary, and to do it as efficiently as possible (specifically, to append data in-place when possible).
For that, I've been doing some reading about the ndarray.resize() method, and the refcheck argument. I understand that refcheck should be set to False only when "you are sure that you have not shared the memory for this array with another Python object".
The thing is I'm not sure. Sometimes I have, sometimes I haven't. I'm fine with it raising an error if refcehck fails (I can catch it and then create a new copy), but I want it to fail only when there are "real" external references, ignoring the ones I know to be safe.
Here's a simplified illustration:
import numpy as np
def array_append(arr, values, refcheck = True):
added_len = len(values)
if added_len == 0:
return arr
old_len = len(arr)
new_len = old_len + added_len
arr.resize(new_len, refcheck = refcheck)
arr[old_len:] = values
return arr
class DataCollector(object):
def __init__(self):
self._new_data = []
self._arr = np.array([])
def add_data(self, data):
self._new_data.append(data)
def get_data_as_array(self):
self._flush()
return self._arr
def _flush(self):
if not self._new_data:
return
# self._arr = self._append1()
# self._arr = self._append2()
self._arr = self._append3()
self._new_data = []
def _append1(self):
# always raises an error, because there are at least 2 refs:
# self._arr and local variable 'arr' in array_append()
return array_append(self._arr, self._new_data, refcheck = True)
def _append2(self):
# Does not raise an error, but unsafe in case there are other
# references to self._arr
return array_append(self._arr, self._new_data, refcheck = False)
def _append3(self):
# "inline" version: works if there are no other references
# to self._arr, but raises an error if there are.
added_len = len(self._new_data)
old_len = len(self._arr)
self._arr.resize(old_len + added_len, refcheck = True)
self._arr[old_len:] = self._new_data
return self._arr
dc = DataCollector()
dc.add_data(0)
dc.add_data(1)
print dc.get_data_as_array()
dc.add_data(2)
print dc.get_data_as_array()
x = dc.get_data_as_array() # create an external reference
print x.shape
for i in xrange(5000):
dc.add_data(999)
print dc.get_data_as_array()
print x.shape
Questions:
- Is there a better (fast) way of doing what I'm trying to do (creating numpy array incrementally)?
- Is there a way of telling the resize() method: "perform refcheck, but ignore that one reference (or n references) which I know to be safe"? (that would solve the problem that _append1() always fails)