-3

I have to perform some operations on data using Python; below is one operation that takes too much time (approximately 21 minutes) and I have to perform many such operations on different datasets. Is it normal, or can it be made faster?

flag = np.array([], dtype=np.bool_)

for i in range(len(dset1)):
    flag = np.append(flag, np.any(abs(dset1[i, 0] - dset2[:, 0]) / 1000 <= 500))

Length of dset1 is 72805 and length of dset2 is 1455873.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • 1
    What's the full `.shape`, not just the first dimension, of `dset1` and `dset2`? – Dominik Stańczak Jun 09 '22 at 15:10
  • 2
    Does this answer your question? [What is the fastest way to stack numpy arrays in a loop?](https://stackoverflow.com/questions/58083743/what-is-the-fastest-way-to-stack-numpy-arrays-in-a-loop) and [Why use numpy over list based on speed?](https://stackoverflow.com/questions/46860970) and [NumPy append vs Python append](https://stackoverflow.com/questions/29839350) and [Python numpy array of numpy arrays](https://stackoverflow.com/questions/31250129) (and many others) – Jérôme Richard Jun 09 '22 at 15:13
  • The full shape of the arrays is (72805, 2) and (1455873, 2). – The_Learner Jun 09 '22 at 15:51

1 Answers1

1

Never use np.append in this way! It allocates a different array 72805 times in this case.

Instead, at the very least do this:

flag = np.array([
    np.any(abs(dset1[i,0]-dset2[:,0])/1000 <= 500) for i in range(len(dset1))
])

First building a list iteratively and then converting it to an array in one go.

If dset1 and dset2 are just arrays, there's yet another optimization to be done here via clever broadcasting - but this should cut most of your runtime.


The other, optimized solution would be to skip the for loop and just vectorize this:

dset1row = dset1[:, 0]
dset2row = dset2[:, 0]
flag2 = np.any((abs(dset1row[:, np.newaxis] - dset2row[np.newaxis, :]) < (500 * 1000)), axis=1)
Dominik Stańczak
  • 2,046
  • 15
  • 27
  • Hi, Thank you very much for your suggestion. I have tried as you have suggested but still, it is taking almost similar time. Both dset1 and dset2 are NumPy arrays, can you please give some reference to broadcasting. Best regards – The_Learner Jun 09 '22 at 15:36
  • Try the edit :) – Dominik Stańczak Jun 09 '22 at 16:09
  • The edit method is certainly better I believe but I get memory error while using it. I think first building a list and then converting that to a NumPy array is the best option in this case. – The_Learner Jun 10 '22 at 08:40