2

I have a large data set, statistic, with statistic.shape = (1E10,) that I want to effectively bin (sum) into an array of zeros, out = np.zeros(1E10). Each entry in statistic has a corresponding index, idx, which tells me in which out bin it belongs. The indices are not unique so I cannot use out += statistic[idx] since this will only count the first time a particular index is encountered. Therefore I'm using np.add.at(out, idx, statistic). My problem is that for very large arrays, np.add.at() returns the wrong answer.

Below is an example script that shows this behaviour. The function check_add() should return 1.

import numpy as np

def check_add(N):
    N = int(N)
    out = np.zeros(N)
    np.add.at(out, np.arange(N), np.ones(N))
    return np.sum(out)/N

n_arr = [1E3, 1E5, 1E8, 1E10]
for n in n_arr:
    print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))

This example returns for me:

N = 1000.0 (log(N) = 3.0); output ratio is 1.0
N = 100000.0 (log(N) = 5.0); output ratio is 1.0
N = 100000000.0 (log(N) = 8.0); output ratio is 1.0
N = 10000000000.0 (log(N) = 10.0); output ratio is 0.1410065408

Can someone explain to me why the function fails for N=1E10?

4 Answers4

2

This is an old bug, NumPy issue 13286. ufunc.at was using a too-small variable for the loop counter. It got fixed a while ago, so update your NumPy. (The fix is present in 1.16.3 and up.)

user2357112
  • 260,549
  • 28
  • 431
  • 505
1

You're overflowing int32:

1E10 % (np.iinfo(np.int32).max - np.iinfo(np.int32).min + 1)  # + 1 for 0
Out[]: 1410065408

There's your weird number (googling that number actually got me to here which is how I figured this out.)

Now, what's happening in your function is a bit more weird. By the documentation of ufunc.at you should just be accumulate-adding the 1 values in the indices that are lower than np.iinfo(np.int32).max and the negative indices above np.iinfo(np.int32).min - but it seems to be 1) working backwards and 2) stopping when it gets to the last overflow. Without digging into the c code I couldn't tell you why, but it's probably a good thing it does - your function would fail silently and with the "correct" mean if it had done things this way, while corrupting your results (having 2 or 3 in those indices and 0 in the middle).

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • It *should* just work - int32 shouldn't be involved at all. Something in the C code may be using a too-small data type somewhere. – user2357112 Nov 12 '20 at 12:59
  • [Found it](https://github.com/numpy/numpy/commit/6dac2d6a4a061a39cc41f93641c0c29085cd10a6) - there was an `int` that should have been a `numpy_intp` in the `ufunc.at` source code. – user2357112 Nov 12 '20 at 13:06
  • Figured as much, but I am not at all qualified to dig through `c` code. – Daniel F Nov 12 '20 at 13:08
0

It is most likely due to integer precision indeed. If you play around with the numpy data-type (e.g. you constrain it to an (unsigned) value between 0-255) by setting uint8, you will see that they ratios start declining already for the second array. I do not have enough memory to test it, but setting all dtypes to uint64 as below should help:

def check_add(N):
    N = int(N)
    out = np.zeros(N,dtype='uint64')
    np.add.at(out, np.arange(N,dtype='uint64'), 1)
    return np.sum(out)/N

To understand the behavior, I recommend setting dtype='uint8' and checking the behavior for smaller N. So what happens is that the np.arange function creates ascending integers for the vector elements until it reaches the integer limit. It then starts again at 0 and counts up again, so at the beginning (smaller Ns) you get correct sum (although your out vector contains a lot of elements >1 in the positions 0:limit and a lot of elements = 0 beyond the limit). If however you choose N large enough, the elements in your out vector start exceeding the integer limit and start again from 0. As soon as that happens your sum is vastly off. To double-check, realize that the uint8 limit is 255(256 integers) and 256^2=65536. Set N = 65536 with dtype='uint8' and check_add(65536) will return 0.

import numpy as np

def check_add(N):
    N = int(N)
    out = np.zeros(N,dtype='uint8')
    np.add.at(out, np.arange(N,dtype='uint8'), 1)
    return np.sum(out)/N

n_arr = [1E1, 1E3, 1E5,65536, 1E7]
for n in n_arr:
    print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))

Also note, that you don't need the np.ones vector but can simply replace it by 1, if all you care about is uniformly incrementing everything by 1.

seulberg1
  • 955
  • 1
  • 7
  • 19
  • But the original `out` and `ones` dtypes were already float64, which isn't subject to integer overflow and has enough precision to handle all operations involved. – user2357112 Nov 12 '20 at 12:44
  • The original arange dtype would have been int64, which is also plenty. – user2357112 Nov 12 '20 at 12:45
  • Afaik, the default datatype for np.arange is int32, which is not enough to handle 1E10. I don't believe this is a bug, but mathematical behavior. – seulberg1 Nov 12 '20 at 13:15
  • The default dtype for numpy.arange depends on the arguments and the OS, but regardless of OS, `numpy.arange(int(1e10))` will use int64 dtype. – user2357112 Nov 12 '20 at 13:34
-1

Guessing as I couldn't run it, but could it be a problem that you are exceeding max integer value in python for the last option? Ie exceeds 2147483647. Use longinteger type instead as per below.

Referring to: [enter link description here][1]https://docs.python.org/2.0/ref/integers.html

Hope this helps. Please let me know if it does work.

  • 2
    That is severely out of date, NumPy math works differently from regular Python ints anyway, and `out` has float64 dtype, not integer dtype. It's likely that some overflow effect is involved somewhere, but Python 2 `long` isn't the solution. – user2357112 Nov 12 '20 at 12:27
  • Good to know. Would be good to know what the issue is. Never came across this before. – mani chooki Nov 12 '20 at 12:40