Handling of duplicate indices in NumPy assignments

Question

I am setting the values of multiple elements in a 2D array, however my data sometimes contains multiple values for a given index.

It seems that the "later" value is always assigned (see examples below) but is this behaviour guaranteed or is there a chance I will get inconsistent results? How do I know that I can interpret "later" in the way that I would like in a vectorized assignment?

i.e. in my first example will a definitely always contain 4 and in the second example would it ever print values[0]?

Very simple example:

import numpy as np
indices = np.zeros(5,dtype=np.int)
a[indices] = np.arange(5)
a # array([4])

Another example

import numpy as np

grid = np.zeros((1000, 800))

# generate indices and values
xs = np.random.randint(0, grid.shape[0], 100)
ys = np.random.randint(0, grid.shape[1], 100)
values = np.random.rand(100)

# make sure we have a duplicate index
print values[0], values[5]
xs[0] = xs[5]
ys[0] = ys[5]

grid[xs, ys] = values

print "output value is", grid[xs[0], ys[0]]
# always prints value of values[5]

To understand what's going on in the workings of Numpy arrays, I recommend http://scipy-lectures.github.io/advanced/advanced_numpy/ — tom10, Apr 12 '13 at 15:44
Nice question... This is one of those you probably have to wait for @seberg to be around to get a meaningful answer. — Jaime, Apr 12 '13 at 16:02
I doubt that anything is guaranteed, but some experiments with exotically strided arrays point to a simple left-to-right loop over the array of indices. — Fred Foo, Apr 12 '13 at 16:03
Good question! The numpy mailing list would really be the best place to ask this. A handful of the core devs occasionally frequent SO, (e.g. DavidCornapeau, rkern, seberg) but none of them do so regularly. It's going to take someone fairly familiar with numpy C codebase to confirm that's what will always happen with the current version, and one of the core devs to say whether or not it's guaranteed. (I think @larsmans is correct, though.) — Joe Kington, Apr 12 '13 at 19:36
Thanks guys - I'll chase this up next week via other avenues if there's no luck here. — YXD, Apr 12 '13 at 22:20

seberg · Accepted Answer · 2014-03-17T18:38:38.097

In NumPy 1.9 and later this will in general not be well defined.

The current implementation iterates over all (broadcasted) fancy indexes (and the assignment array) at the same time using separate iterators, and these iterators all use C-order. In other words, currently, yes you can. Since you maybe want to know it more exact. If you compare mapping.c in NumPy, which handles these things, you will see that it uses PyArray_ITER_NEXT, which is documented to be in C-order.

For the future I would paint the picture differently. I think it would be good to iterate all indices + the assignment array together using the newer iterator. If this is done, then the order could be kept open for the iterator to decide the fastest way. If you keep it open to the iterator, it is hard to say what would happen, but you cannot be certain that your example works (probably the 1-d case you still can, but...).

So, as far as I can tell it works currently, but it is undocumented (for all I know) so if you actually think that this should be ensured, you would need to lobby for it and best write some tests to make sure it can be guaranteed. Because at least am tempted to say: if it makes things faster, there is no reason to ensure C-order, but of course maybe there is a good reason hidden somewhere...

The real question here is: Why do you want that anyway? ;)

Thanks very much for the answer. I will write up a justification of why I am asking / why this came up in the next couple of days. I don't consider myself knowledgeable enough to say whether there is a genuinely good reason to keep C-order though... — YXD, Apr 16 '13 at 10:11
I saw [this discussion](http://mail.scipy.org/pipermail/numpy-discussion/2014-February/068810.html) about numpy 1.9 and was wondering what the implications were for the 2D case presented in this question. The motivation comes from a vision problem where I project some 3D data down into discrete pixel coordinates and need to efficiently keep track of the "best" data per-pixel, where due to the projection there may be collisions in image-space. My real code achieves this by firsts sorting the data according to the cost/error and then assigning as shown in the question. — YXD, Mar 17 '14 at 15:41

score 8 · Answer 2 · edited Jan 03 '23 at 21:00

I know this has been satisfactorily answered, but I wanted to mention that it is documented as being the "last value" (perhaps informally) in the Tentative Numpy Tutorial under Indexing with Arrays of Indices:

However, when the list of indices contains repetitions, the assignment is done several times, leaving behind the last value:
>>> a = arange(5)
>>> a[[0,0,2]]=[1,2,3]  
>>> a
array([2, 1, 3, 3, 4])  
This is reasonable enough, but watch out if you want to use Python's += construct, as it may not do what you expect:
>>> a = arange(5) 
>>> a[[0,0,2]]+=1  
>>> a
array([1, 1, 3, 3, 4])  
Even though 0 occurs twice in the list of indices, the 0th element is only incremented once. This is because Python requires a+=1 to be equivalent to a=a+1.

shx2 · Answer 3 · 2013-04-16T11:18:32.010

I'm not answering you question directly, I'd just like to make a point that even if you can rely on that behavior being consistent, you better not.

Consider:

a = np.zeros(4)
x = np.arange(4)
indices = np.zeros(4,dtype=np.int)
a[indices] += x

At this point, is it reasonable to assume that a.sum() is a's previous sum + x.sum()?

assert a.sum() == x.sum()
--> AssertionError 

a
= array([ 3.,  0.,  0.,  0.])

In your case, when assigning to an array using duplicate indices, the result is intuitive: assignment to the same index takes place multiple times, thus only the last assignment "sticks" (it overwrites previous ones).

But this is not the case in this example. It is no longer intuitive. If it were, the in-place addition would have taken place multiple times, because addition is cumulative in its nature.

So, put another way, you are risking getting caught in this trap:

you start working with duplicate indices
you see all is well, behavior is exactly as you expect
you stop paying attention to the crucial fact that your operations involve duplicate indices. After all, it makes no difference, does it?
you start using the same indices in different contexts, e.g. as above
deep regrets :)

So, quoting @seberg:

The real question here is: Why do you want that anyway? ;)

That's definitely an interesting case. And yes, your bullet points are exactly why I asked the question. In my original examples it seems more obvious what "should" be happening though. I'll post some background / context in the next couple of days. — YXD, Apr 16 '13 at 10:43

Soleares · Answer 4 · 2015-01-15T00:11:04.603

I found a way with with numpy to do this operation this is clearly not optimal but its faster than looping (with a python for loop)

with: numpy.bincount

size = 5
a = np.arange(size)
index = [0,0,2]
values = [1,2,3]
a[index] += values
a
[2 1 5 3 4]

witch is not correct but:

size = 5
a = np.arange(size)
index = [0,0,2]
values = [1,2,3]
result = np.bincount(index, values, size)
a += result
a
[3 1 5 3 4]

wich is good !

score 4 · Answer 5 · answered Oct 05 '22 at 10:04

4

Time changes. An updated answer should be given.

size = 5
a = np.arange(size)
index = [0,0,2]
values = [1,2,3]
np.add.at(a,[0,0,2],values)
a

answered Oct 05 '22 at 10:04

flumer

157
5

Handling of duplicate indices in NumPy assignments

5 Answers5

Linked

Related