1

I'm facing to 2 issues in the following snippet using np.where (looking for indexes where A[:,0] is identical to B)

  1. Numpy error when n is above a certain value (see error)
  2. quite slow
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.

So I'm wondering what I'm missing and/or misunderstanding, how to fix it, and how to speed-up the code. This is a basic example I've made to mimic my code, but in fact I'm dealing with arrays having (dozens of) millions of rows.

Thanks for your support

Paul

import numpy as np
import time

n=100_000  # with n=10 000 ok but quit slow
m=2_000_000



#matrix A
# A=np.random.random ((n, 4))
A = np.arange(1, 4*n+1, dtype=np.uint64).reshape((n, 4), order='F')

#Matrix B
B=np.random.randint(1, m+1, size=(m), dtype=np.uint64)
B=np.unique(B) # duplicate values are generally generated, so the real size remains lower than n

# use of np.where
t0=time.time()
ind=np.where(A[:, 0].reshape(-1, 1) == B)
# ind2=np.where(B == A[:, 0].reshape(-1, 1))
t1=time.time()
print(f"duration={t1-t0}")
Paul18fr
  • 21
  • 3
  • Could you describe what the actual end goal of this is? – Dominik Stańczak Aug 12 '22 at 08:17
  • goal ... basically to find the location of elements A into B (B necessarily greater than A) - all data are unsigned integers. From rows locations, I can perform addition tasks. Off course speed is a keyword and I'm using a often as possible vectorization – Paul18fr Aug 12 '22 at 13:50
  • `len(B)` usually turns out to be around 1_264_000 (+- 500). `dtype('bool')` [takes 1 byte](https://stackoverflow.com/q/5602155/14627505). So with your current parameters, `(A[:, 0].reshape(-1, 1) == B)` takes around 31_600_000_000 bytes (31.6 GB). You can check it by yourself with `(A[:, 0].reshape(-1, 1) == B).nbytes`. If you want to add another 0 to your `n`, this will become 316 GB. Do you still want to use this, or you would rather describe your more general problem that you are trying to solve? – Vladimir Fokow Aug 12 '22 at 17:42
  • humm ... ok for the feedback, especially for the test you provided. I ever been confront to memory issues, but I was not thinking about it from the warning. As previously said, I'm focussing to find location of common values of A[:, 0] into B: to extract data complete rows, or to remove them for example. I guess I can imagine different tests: 1) making loops using one row at a time (with the help of Numba to speed-up the process 2) to parse A into smaller 1D array to reduce the number of loops. Other suggestions? Paul – Paul18fr Aug 12 '22 at 20:29
  • What input for `A` do you **really** have? In your code you have a line commented out. – Vladimir Fokow Aug 12 '22 at 20:34
  • Do I understand correctly that we use only the first column of `A`? All its other values are irrelevant? Maybe you should then declare `A` only to have the values that we use - to be more clear – Vladimir Fokow Aug 12 '22 at 20:38
  • Why are you saying “B necessarily greater than A”? B can equal A in some positions – Vladimir Fokow Aug 12 '22 at 20:45
  • Do you need both row and column of the matching numbers? Or only rows? – Vladimir Fokow Aug 12 '22 at 20:59

1 Answers1

1

In your current implementation, A[:, 0] is just

np.arange(n/4, dtype=np.uint64)

And if you are interested only in row indexes where A[:, 0] is in B, then you can get them like this:

row_indices = np.where(np.isin(first_col_of_A, B))[0]
  • If you then want to select the rows of A with these indices, you don't even have to convert the boolean mask to index locations. You can just select the rows with the boolean mask: A[np.isin(first_col_of_A, B)]

  • There are better ways to select random elements from an array. For example, you could use numpy.random.Generator.choice with replace=False. Also, Numpy: Get random set of rows from 2D array.

  • I feel there is almost certainly a better way to do the whole thing that you are trying to do with these index locations. I recommend you study the Numpy User Guide and the Pandas User Guide to see what cool things are available there.


Honestly, with your current implementation you don't even need the first column of A at all, because row indicies simply equal the elements of A[:, 0]. Here:

row_indices = B[B < n]
row_indices.sort()
print(row_indices)
Vladimir Fokow
  • 3,728
  • 2
  • 5
  • 27
  • thanks for your contribution and for links; let me having a look on what you've suggested. => yes I'm focusing on rows indexes of common values - Do not pay attention on what I've made prior the np.where, it has been made only to mimic my model (I didn't imagine it has been a memory issue, but maybe a type conversion issue and something else). Paul – Paul18fr Aug 13 '22 at 08:09
  • It looks like you where trying to solve a simple problem in a very complicated way. Without more accurately described inputs there's not much more I can do. – Vladimir Fokow Aug 13 '22 at 08:13
  • I was not precise enough (my fault I guess), but I want to find common values between the first column in A and B, and to get their position in the array (row number); values do not follow each other; the new challenge will be to speed-up it by reducing the amont of memory. Paul – Paul18fr Aug 13 '22 at 08:35
  • @Paul18fr Well, does my answer solve your problem? What do you mean by "values do not follow each other". I would like you to edit your question to show what values actually are there, and remove everything unnecessary – Vladimir Fokow Aug 13 '22 at 08:40
  • the simple 'row_indices = np.where(np.isin(first_col_of_A, B))[0]' runs 1200 times faster and do not need as much memory as my snippet ... so yes absolutly, thanks :-) – Paul18fr Aug 13 '22 at 08:44