5

With reference to Why NumPy instead of Python lists?

tom10 said :

Speed: Here's a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test -- mileage may vary).

But my test using the following code:

import numpy as np
import time as time

N = 100000

#using numpy
start = time.time()
array = np.array([])

for i in range(N):
    array = np.append(array, i)

end = time.time()
print ("Using numpy: ", round(end-start, 2), end="\n")

#using list
start = time.time()
list = []

for i in range(N):
    list.append(i)

list = np.array(list)   
end = time.time()
print ("Using list : ", round(end-start, 2), end="\n")

Give the result:

Using numpy: 8.35
Using list : 0.02

It is true that when using "append", list is better than numpy ?

Pak Long
  • 75
  • 1
  • 5
  • 3
    Yes, `.append` is constant (amortized) constant time with `list` objects, with `numpy.ndarray` objects it is linear ti.e – juanpa.arrivillaga Oct 21 '17 at 07:19
  • Is there any way I can do `.append` to `numpy.ndarray` in the same manner as `list`? – Pak Long Oct 21 '17 at 07:27
  • Numpy arrays are designed for holding potentially multidimensional matrices where appending can't be in general as efficient as in the simple 1D case. Commonly in the code you'd see np.zeros(shape) calls allocating enough elements in advance, where you already know size of your data. If you need to append too often, you should probably just stick to built-in lists. – lomereiter Oct 21 '17 at 08:38
  • 4
    Often people collect data into Vanilla Python lists and only when they need to process it a `numpy.array` will be generated – MaxU - stand with Ukraine Oct 21 '17 at 09:25
  • @coldspeed, numpy lists are an array of object references The append is constant because the list object reserves more space than there are items in the list, so it can add items without requesting extra memory. see https://docs.python.org/2/faq/design.html#how-are-lists-implemented – user2699 Oct 21 '17 at 22:26
  • @cᴏʟᴅsᴘᴇᴇᴅ: what? Python lists are not linked lists, they are dynamic arrays. Append is amortized O(1) because array reallocation doesn't happen at every append, but only once the array capacity is exhausted. This means that most appends are O(1), and once in a while they cost O(n). This "once in a while" is actually exponential in n (as each reallocation allocates roughly 1.125 * needed elements), so the linear cost is killed by the exponential rarity of the reallocation; once we average the cost over infinite appends, each one is O(1). But again, this has nothing to do with linked lists. – Matteo Italia Oct 21 '17 at 23:39
  • @MatteoItalia yeah, that was a slip-up. I am aware of the preallocation and amortised complexity, but I appreciate the explanation nonetheless. – cs95 Oct 21 '17 at 23:40

1 Answers1

5

To answer your question, yes. Appending to an array is an expensive operation, while lists make it relatively cheap (see Internals of Python list, access and resizing runtimes for why). However, that's no reason to abandon numpy. There are other ways to easily add data to a numpy array.

There are surprising (to me, anyway) amount of ways to do this. Skip to the bottom to see benchmarks for each of them.

Probably the most common is to simply pre-allocate the array, and index into that,

#using preallocated numpy
start = time.time()
array = np.zeros(N)

for i in range(N):
    array[i] = i

end = time.time()
print ("Using preallocated numpy: ", round(end-start, 5), end="\n")

Of course, you can preallocate the memory for a list too, so lets include that for a benchmark comparison.

#using preallocated list
start = time.time()
res = [None]*N

for i in range(N):
    res[i] = i

res = np.array(res)
end = time.time()
print ("Using preallocated list : ", round(end-start, 5), end="\n")

Depending on your code, it may also be helpful to use numpy's fromiter function, which uses the results of an iterator to initialize the array.

#using numpy fromiter shortcut
start = time.time()

res = np.fromiter(range(N), dtype='float64') # Use same dtype as other tests

end = time.time()
print ("Using fromiter : ", round(end-start, 5), end="\n")

Of course, using a built in iterator isn't terribly flexible so let's try a custom iterator as well,

#using custom iterator
start = time.time()
def it(N):
    i = 0
    while i < N:
        yield i
        i += 1

res = np.fromiter(it(N), dtype='float64') # Use same dtype as other tests

end = time.time()
print ("Using custom iterator : ", round(end-start, 5), end="\n")

That's two very flexible ways of using numpy. The first, using a preallocated array, is the most flexible. Let's see how they compare:

Using numpy:  2.40017
Using list :  0.0164
Using preallocated numpy:  0.01604
Using preallocated list :  0.01322
Using fromiter :  0.00577
Using custom iterator :  0.01458

Right off, you can see that preallocating makes numpy much faster than using lists, although preallocating the list brings both to about the same speed. Using a builtin iterator is extremely fast, although the iterator performance drops back into the range of the preallocated array and list when a custom iterator is used.

Converting code directly to numpy often has poor performance, as with append. Finding an approach using numpy's methods can almost always give a substantial improvement. In this case, preallocating the array or expressing the calculation of each element as an iterator to get similar performance to python lists. Or use a vanilla python list since the performance is about the same.

EDITS: Original answer also included np.fromfunction. This was removed since it didn't fit the use case of adding one element at a time, fromfunction actually initializes the array and uses numpy's broadcasting to make a single function call. It is about a hundred times faster, so if you can solve your problem using broadcasting don't bother with these other methods.

user2699
  • 2,927
  • 14
  • 31
  • `fromfunction` doesn't actually make separate function calls for each entry; it assumes the function broadcasts, and it calls the function once with arrays as input, expecting the function to produce a single array which will be returned directly. (The docs aren't very clear on this, but at least they're clearer in 1.13; they used to be much worse.) – user2357112 Oct 21 '17 at 23:48
  • @user2357112 Thanks, I did not realize that's how it worked. That's much less useful than I thought. I've updated the post. – user2699 Oct 22 '17 at 13:03