What is the best way to incrementally build a numpy array?

Question

What is the most efficient way of incrementally building a numpy array, e.g. one row at a time, without knowing the final size in advance?

My use case is as follows. I need to load a large file (10-100M lines) for which each line requires string processing and should form a row of a numpy array.

Is it better to load the data to a temporary Python list and convert to an array or is there some existing mechanism in numpy that would make it more efficient?

Adding rows to an existing ndarray is accomplished with `numpy.concatenate` or `numpy.vstack`. What will a single row look like? — derricw, May 26 '15 at 20:56
`loadtxt` and `genfromtxt` both collect a list of lists (line by line), and convert it to an array at the end. Their code is Python so you can browse it. — hpaulj, May 26 '15 at 21:01
If the data format is known exactly and each line is data, you can count the number of lines and preallocate your array. This probably won't be faster, but you can avoid creating two objects that hold the same large amount of data. See http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python — Alan, May 26 '15 at 21:27
Memory should not be an issue here yet, although it might be a viable solution for some cases. — Andrzej Pronobis, May 26 '15 at 22:19
@derricw No! vstack and the likes will clear and reassign all the memory! this would take forever. If possible, don't use them! — Gulzar, Sep 29 '19 at 10:54

score 10 · Accepted Answer · answered May 26 '15 at 21:23

You should get better performance out of appending each row to a list and then converting to an ndarray afterward.

Here's a test where I append ndarrays to a list 10000 times and then generate a new ndarray when I'm done:

row = np.random.randint(0,100, size=(1,100))

And I time it with ipython notebook:

%%timeit
l = [row]

for i in range(10000):
    l.append(row)

n = np.array(l)

-> 10 loops, best of 3: 132 ms per loop

And here's a test where I concatenate each row:

%%timeit
l = row

for i in range(10000):
    l = np.concatenate((l, row),axis=0)

-> 1 loops, best of 3: 23.1 s per loop

Way slower.

The only issue with the first method is you will wind up with both the list and the array in memory at the same time, so you could potentially have RAM issues. You could avoid that by doing it in chunks.

Indeed it does make a huge difference, even if we make a copy of each row when appending to the list. — Andrzej Pronobis, May 26 '15 at 22:18

Moritz · Answer 2 · 2015-05-27T00:01:12.530

3

On my laptop 1 core intel i5 1.7 GHz:

%%timeit
l = [row]

for i in range(10000):
    l.append(row)

n = np.array(l)
100 loops, best of 3: 5.54 ms per loop

My best try with pure numpy (maybe someones knows a better solution)

%%timeit
l = np.empty( (1e5,row.shape[1]) )
for i in range(10000):
    l[i] = row
l = l[np.all(l > 1e-100, axis=1)]

10 loops, best of 3: 18.5 ms per loop

edited May 27 '15 at 00:01

answered May 26 '15 at 23:35

Moritz

5,130
10
40
81

That's interesting that pre-allocating with np.empty still isn't faster than the list. – derricw May 27 '15 at 19:00
Well, if I preallocate it with the proper size, the speed is similar. But since the OP does not know the size beforehand, I had to use a larger array. – Moritz May 27 '15 at 19:09

What is the best way to incrementally build a numpy array?

2 Answers2