1

I have numpy M*N array.

numpy.random.seed(23)
A = numpy.round(numpy.random.random((4, 10)), 2)

array([[0.59, 0.77, 0.66, 0.56, 0.18, 0.24, 0.51, 0.4 , 0.48, 0.96],
       [0.9 , 0.51, 0.82, 0.83, 0.23, 0.08, 0.47, 0.88, 0.15, 0.23],
       [0.92, 0.13, 0.92, 0.23, 0.62, 0.95, 0.26, 0.45, 0.97, 0.24],
       [0.2 , 0.69, 0.85, 0.45, 0.1 , 0.62, 0.08, 0.05, 0.35, 0.91]])

Then I have a list of M lists of indexes:

ind = [[1,2], 
       [4,7,8,9], 
       [3,6,7], 
       [4,5,1]]

Each list contains indexes of the corresponding row to be nulled, i.e. at row #0: 1th and 2nd elements should be nulled, etc...

And I should obtain:

array([[0.59, 0.  , 0.  , 0.56, 0.18, 0.24, 0.51, 0.4 , 0.48, 0.96],
       [0.9 , 0.51, 0.82, 0.83, 0.  , 0.08, 0.47, 0.  , 0.  , 0.  ],
       [0.92, 0.13, 0.92, 0.  , 0.62, 0.95, 0.  , 0.  , 0.97, 0.24],
       [0.2 , 0.  , 0.85, 0.45, 0.  , 0.  , 0.08, 0.05, 0.35, 0.91]])

The forehead solution is:

for i in range(len(ind)):
    A[i, ind[i]] = 0

But, you know, it's too slow. Is it possible to have "vectorized" solution?

Iliya Fiks
  • 27
  • 2
  • Since the column lists vary in length, easy 'vectorizung' isn't possible. I can imagine expanding the row indices with `repeat` to match `np.hstack(ind)` and doing one assignment, but can't promise a speedup. – hpaulj Jul 01 '22 at 21:24
  • @hpaulj I thought about numpy.vectorize, but it does not work natively with 2d arrays )) – Iliya Fiks Jul 01 '22 at 21:36
  • [`np.vectorize`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) has a *Note* about expecting performance improvements. Can the part that creates *ind* be changed or is it fixed? – Michael Szczesny Jul 01 '22 at 21:43
  • https://stackoverflow.com/questions/20103779/index-2d-numpy-array-by-a-2d-array-of-indices-without-loops might help but without jagged ind lengths – rikyeah Jul 01 '22 at 21:47
  • @MichaelSzczesny yes... it's always different – Iliya Fiks Jul 01 '22 at 21:49
  • That's not what I meant. Can the code be changed that is creating the list? – Michael Szczesny Jul 01 '22 at 21:55
  • @MichaelSzczesny no – Iliya Fiks Jul 01 '22 at 22:00
  • 1
    Converting *ind* into a vectorizable format will be slower than iterating over it once. Zipping *A,ind* is 2x faster in my benchmarks: `for i,j in zip(A, ind): i[j] = 0` – Michael Szczesny Jul 01 '22 at 22:15
  • 1
    @IliyaFiks why not? How are you generating these indices in the first place? – ddejohn Jul 01 '22 at 22:27
  • If you're willing to make some compromise in how these values are actually generated, this task becomes trivial. Otherwise, you're stuck with the "forehead" solution because there's just no other way to use ragged arrays. Also, the "forehead" solution *does* still take advantage of advanced indexing, FWIW. – ddejohn Jul 01 '22 at 22:31

2 Answers2

3

To answer the question in the post,

Is it possible to have "vectorized" solution?

No. It is not possible to take advantage of vectorization because your ind list is ragged. However, depending on how you acquire these index values, changing that process slightly makes the task trivial:

rows = [0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
cols = [1, 2, 4, 7, 8, 9, 3, 6, 7, 4, 5, 1]

A[rows, cols] = 0

If you can produce your indices this way in the first place, then you're golden.

Again, as others have mentioned, this information (the rows and columns) is already present in your indices. Depending on how they were generated (which I suggest you share as an edit to the body of the question) in the first place (either calculated or read from a file, whatever) you should theoretically be able to generate your indices in the appropriate format in the first place, thereby negating the necessity for your "forehead" workaround or having to transform the indices into a numpy-compatible format after the fact altogether.

So, TL;DR -- if you can generate your indices in the format I've shown above in the first place, do so. If not, then there's no point in using anything other than the "forehead" solution.

ddejohn
  • 8,775
  • 3
  • 17
  • 30
  • 2
    All this information must be available anyway when the irregular list is created. With `numpy` arrays, this would also use less memory. I disagree that this answer is not useful for future readers. Similar questions are asked frequently. – Michael Szczesny Jul 01 '22 at 22:46
  • "All this information..." -- exactly, hence why I wanted to actually put it in answer form, since OP seems to be missing the forest for the trees here. I'll redact the bit about it not being useful. – ddejohn Jul 01 '22 at 23:01
  • I timed this appraoch - with the creation of the full `row/col`, and found that it is a bit slower - for this particular combination of shapes. I expect relative speeds to very. – hpaulj Jul 01 '22 at 23:04
  • The idea is that if OP has control of the process of generating these indices in the first place then the benefit will be greater, presumably. Obviously transforming *what OP currently has* into something numpy likes will not benefit them, *but* if they can simply make their indices "numpy-compatible" in the first place, then that likely will net OP appreciable performance. – ddejohn Jul 01 '22 at 23:08
  • Thank you for the reply. No, unfortunately, I can't do it. But, I think I can try to generate indexes in "mask" format. And then try to use it. – Iliya Fiks Jul 04 '22 at 10:36
1

Your iterative solution:

In [78]: %%timeit 
    ...: for i in range(len(ind)):
    ...:     A[i, ind[i]] = 0
34.4 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The alternative that I suggested in a comment:

In [79]: d=np.arange(4).repeat([len(i) for i in ind])    
In [80]: d
Out[80]: array([0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3])    
In [81]: A[d,np.hstack(ind)]
Out[81]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
    
In [83]: %%timeit
    ...: d=np.arange(4).repeat([len(i) for i in ind])
    ...: A[d,np.hstack(ind)] = 0
37.1 µs ± 71.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This alternative is a bit slower. However the relative speeds will vary for real world cases, depending on the shape of A, and the number of indices in a ind.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thank you for your solution. In my case, typicall size of A is 10000 * 3000. So.... ` %time for i, indexes in enumerate(indexes_to_zero): A[i, indexes] = 0
    CPU times: total: 4.09 s Wall time: 4.09 s ` And ` %%time d = numpy.arange(A.shape[0]).repeat([len(i) for i in indexes_to_zero]) A[d, numpy.hstack(indexes_to_zero)] = 0 CPU times: total: 4.61 s Wall time: 4.6 s `
    – Iliya Fiks Jul 04 '22 at 10:33