I have a large data set, statistic
, with statistic.shape = (1E10,)
that I want to effectively bin (sum) into an array of zeros, out = np.zeros(1E10)
. Each entry in statistic
has a corresponding index, idx
, which tells me in which out
bin it belongs. The indices are not unique so I cannot use out += statistic[idx]
since this will only count the first time a particular index is encountered. Therefore I'm using np.add.at(out, idx, statistic)
. My problem is that for very large arrays, np.add.at() returns the wrong answer.
Below is an example script that shows this behaviour. The function check_add()
should return 1.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N)
np.add.at(out, np.arange(N), np.ones(N))
return np.sum(out)/N
n_arr = [1E3, 1E5, 1E8, 1E10]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
This example returns for me:
N = 1000.0 (log(N) = 3.0); output ratio is 1.0
N = 100000.0 (log(N) = 5.0); output ratio is 1.0
N = 100000000.0 (log(N) = 8.0); output ratio is 1.0
N = 10000000000.0 (log(N) = 10.0); output ratio is 0.1410065408
Can someone explain to me why the function fails for N=1E10
?