0

I need to add a lot of values to a numpy array in loop (about 100k), and know this methods:

import numpy as np
import time

#Method 1:
start = time.time()
b = np.array([0.0])
for i in range (1, 100000):
    b = np.append(b, np.array([i]))
end = time.time()
print(end-start)

#Method 2:
start = time.time()
a = np.array([0])
A = np.empty(99999) * np.nan
a = np.concatenate((a, A), axis=0)
for i in range (1, 100000):
    a[i] = i
end = time.time()
print(end-start)


_______________________________
result:
3.2555339336395264
0.018993854522705078

As you see, method 2 is faster, but the problem is I must remove np.nan from my array (because I don't know how many values should I add to my array, so I create np.nan array larger than it should be). Is there any way?

Qeyzho
  • 1
  • Will you always be adding only one value at a time? I assume you you won't just be requiring adding consecutive integers to your array in the real case? – Matt Pitkin Mar 28 '23 at 08:55
  • Test your alternatives on small examples to ensure the results are right. Usually list append is a good option. Make an array from the list with one call at the end. – hpaulj Mar 28 '23 at 14:33
  • Repeated append/concatenate is slow because it makes a new array each time, resulting in lots of copies. List append is much better because it operates in-place, and has a builtin mechanism for growth. – hpaulj Mar 28 '23 at 15:26

4 Answers4

0

You can use nan_to_num to remove NaN:

a = np.nan_to_num(a)
Mazhar
  • 1,044
  • 6
  • 11
0

If it's a constant value that you're adding, you can use np array + a constant value to add to every element of the array. Since it's not 100% what your initialisation is for, you can also use arange to set the array up.

i.e.

import time

start = time.time()
a = np.arange(100000)
a = a+3
end = time.time()
print(end-start)

___________________________
result: 0.0007009506225585938
blackrat
  • 96
  • 5
0

To my knowledge, the second approach is optimal when the size cannot be determined beforehand, and you cannot predict anything about the values. In such cases, np.isnan() function can be used to eliminate the null values.

Moreover, multiplying the empty array by np.nan is unnecessary. Here's an alternative implementation (Method 3), I hope it helps:

import numpy as np
import time

#Method 1:
start = time.time()
b = np.array([0.0])
for i in range (1, 100000):
    b = np.append(b, np.array([i]))
end = time.time()
print(end-start)

#Method 2:
start = time.time()
a = np.array([0])
A = np.empty(99999) * np.nan
a = np.concatenate((a, A), axis=0)
for i in range (1, 100000):
    a[i] = i
end = time.time()
print(end-start)

#Method 3:
start = time.time()
a = np.array([0])
A = np.empty(99999)
a = np.concatenate((a, A), axis=0)
for i in range (1, 100000):
    a[i] = i
a_new = a[~np.isnan(a)]
end = time.time()
print(end-start)

OUTPUT:

4.930854797363281
0.020646095275878906
0.018013954162597656

Process finished with exit code 0
0

It all depends on what your end case is. If you know ahead of time how many inputs you'll have – as in your toy example – then you're fine to allocate a sufficiently large array. So far as "removing np.nan from my array" is concerned: you could create a sliced view of the overallocated array, as long as you know how many valid items it contains.

In the general streaming case where data is continually coming in, the optimal strategy is to resize as necessary, increasing the array size by a factor, e.g. 2:

# Method 3
start = time.time()
a = np.array([0])
A = np.empty(100)  # Inappropriately small initial size
a = np.concatenate((a, A), axis=0)

LEN = 100000
for i in range (1, LEN):
   if i >= len(a):
       a.resize(len(a) * 2)
   a[i] = i
a = a[:LEN]  # truncated view

end = time.time()
print(end-start)

This is more appropriate if you have an input of unknowable length. In that case though, array.array is a much better option for buffering input, and then you can then convert to ndarray for your computations.

# Method 4
import array
start = time.time()
a = array.array("d")  # Assuming float data
a.append(0)

LEN = 100000
for i in range (1, LEN):
   a.append(i)

a = np.frombuffer(a, dtype=float)

end = time.time()
print(end-start)
motto
  • 2,888
  • 2
  • 2
  • 14