0

This post helped to achieve what I wanted but the implementation takes longer for some large datasets I work onNumPyhave two NumPy arrays (fairly large):

p[:24]=array([[ 0.18264738, -0.00326727,  0.01799096],
   [ 0.18198644, -0.00051316,  0.01800063],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18215604,  0.00157497,  0.01799999],
   [ 0.18286349,  0.0036474 ,  0.01799824],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18399446,  0.00528562,  0.01799998],
   [ 0.18573835,  0.0068323 ,  0.01799908],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18573835,  0.0068323 ,  0.01799908],
   [ 0.18744153,  0.00758001,  0.018     ],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18744153,  0.00758001,  0.018     ],
   [ 0.18956973,  0.00801727,  0.01800126],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.19157426,  0.0078435 ,  0.018     ],
   [ 0.19366005,  0.00714792,  0.01800038],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.19584496,  0.0055142 ,  0.01799665],
   [ 0.19701494,  0.00384344,  0.01800058],
   [ 0.19366005,  0.00714792,  0.01800038],
   [ 0.19584496,  0.0055142 ,  0.01799665],
   [ 0.18999948,  0.        ,  0.0226188 ]]

v[:24]=array([[ 0.18264738, -0.00326727,  0.01799096],
   [ 0.18198644, -0.00051316,  0.01800063],
   [ 0.18999948,  0.        ,  0.0226188 ],
   [ 0.18215604,  0.00157497,  0.01799999],
   [ 0.18286349,  0.0036474 ,  0.01799824],
   [ 0.18399446,  0.00528562,  0.01799998],
   [ 0.18573835,  0.0068323 ,  0.01799908],
   [ 0.18744153,  0.00758001,  0.018     ],
   [ 0.18956973,  0.00801727,  0.01800126],
   [ 0.19157426,  0.0078435 ,  0.018     ],
   [ 0.19366005,  0.00714792,  0.01800038],
   [ 0.19584496,  0.0055142 ,  0.01799665],
   [ 0.19701494,  0.00384344,  0.01800058],
   [ 0.19775054,  0.0019907 ,  0.01800372],
   [ 0.19800517, -0.00065405,  0.01800135],
   [ 0.19731225, -0.00330035,  0.01799999],
   [ 0.19596213, -0.00537427,  0.01800001],
   [ 0.18937038, -0.00797523,  0.018     ],
   [ 0.18739267, -0.00759293,  0.01799974],
   [ 0.18565072, -0.00671446,  0.018     ],
   [ 0.18411626, -0.00545196,  0.01800367],
   [ 0.19136006, -0.00791202,  0.01799961],
   [ 0.1938769 , -0.00702934,  0.01799973],
   [ 0.1314003 , -0.06724723,  0.0645    ]])

v array is generated from p array using:

p_uniques, p_indices, p_inverse, p_counts = np.unique(
                                              p, return_index=True, 
                                              return_inverse=True, 
                                              return_counts=True, 
                                              axis=0)

v = p[np.sort(p_indices, axis=None)]

Now, the target is to generate an array containing the indices/occurrences of elements of the v array in the p array including duplicates. Therefore, the desired output would be:

indices[:24]=array([ 0,  1,  2,  3,  4,  2,  5,  6,  2,  6,  7,  2,  
                     7,  8,  2,  9, 10, 2,  2, 11, 12, 10, 11,  2])

I just posted the first 24 indices from the indices array to save space.

I tried various methods using np.where, np.isin, and others but I could not achieve the desired result with better performance over the solution shared in the linked post.

I'd greatly appreciate your help.

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
Ravi
  • 3
  • 2

1 Answers1

0

The key insight here is that v is a permutation of p_uniques and np.argsort(p_indices) provides this permutation. Inverting this permutation gives us the mapping that we have to apply to p_inverse to get what we want.

To invert the permutation, we use the code from How to invert a permutation array in numpy

# p_indices: len(v), range(0, len(p)). Maps v indices to p indices
# p_inverse: len(p), range(0, len(v)). Maps p indices to p_unique indices
p_uniques, p_indices, p_inverse = np.unique(
      p, return_index=True, return_inverse=True, axis=0)

# len(v), range(0, len(v)). Maps v indices to p_unique indices
sort_permut = np.argsort(p_indices)
v = p_uniques[sort_permut]

# len(v), range(0, len(v)). Maps p_unique indices to v indices
inv_sort = np.empty_like(sort_permut)
inv_sort[sort_permut] = np.arange(len(inv_sort))

# len(p), range(0, len(v)). Maps p indices to v indices
indices = inv_sort[p_inverse]
Homer512
  • 9,144
  • 2
  • 8
  • 25
  • Thanks for the explanation. Unfortunately, it does not work. I think I should have explained my problem better. v array is generated from p array by using: `p_uniques, p_indices = np.unique( p, return_index=True, axis=0 ) v = p[np.sort(p_indices, axis=None)] ` Therefore, v is nothing but the unique values from the p array. Now, I want to generate an indices array that tells all the occurrences of elements of v in the p array. – Ravi Sep 23 '22 at 06:31
  • @Ravi then I suggest you edit your question because I'm not answering a question that isn't asked ;-) But I can already tell you that it is literally just the ```return_inverse=True``` option, followed by some index mapping , maybe with```argsort```, to keep in sync with the sorting – Homer512 Sep 23 '22 at 06:44
  • I edited my question and added more clarification. `return_inverse=True` generates an array of indices to reconstruct the original array, in my case the p array. I tried `argsort` but it generates indices of the v array and does not include duplicate occurrences. For more clarification, you can visit this [post](https://stackoverflow.com/questions/64930665/find-indices-of-rows-of-numpy-2d-array-in-another-2d-array) but the solution mentioned takes a longer time and I want to achieve a faster solution. – Ravi Sep 23 '22 at 08:31
  • @Ravi this should do it – Homer512 Sep 24 '22 at 10:18
  • you're awesome, man!!! It works like charm and significantly brings down the time consumed in computation. To give you the context, for some datasets the time has come down from 7 secs to 0.63 msec. – Ravi Sep 25 '22 at 07:07