Sum of consecutive integers in numpy is incorrect

Question

In summing the first 100,000,000 positive integers using the following:

import numpy as np
np.arange(1,100000001).sum()

I return: 987459712, which does not match the formula: N(N+1)/2 for N=100000000. Namely, the formula returns 5000000050000000.

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

For integers greater than 65536 the code returns False, whereas integers below this threshold return True.

Could someone explain either what I've done wrong in calculating the sum, or what is going on with the code?

This feels like an implementation-specific behavior. What version and implementation of Python are you on? I could correctly compute the expected result using CPython 3.6.3 on a 64-bit machine. — bow, Nov 10 '17 at 14:44
I'd guess numpy assumes ints given the numbers. But usually int can hold up to 65536. After that you have an overflow and you get inconsistencies. — Ignacio Vergara Kausel, Nov 10 '17 at 14:44
I'm also using numpy version 1.12.1. The python version is 3.6.1 — sunspots, Nov 10 '17 at 14:50

score 5 · Accepted Answer · edited Nov 10 '17 at 20:02

5

Seems like numpy sometimes has a hard time guessing the correct datatype.

On my system, Win 10 64-bit, Python 3.4.4, numpy 1.13.1:

>>  np.arange(1, 100000001).sum()
987459712
>>  np.arange(1, 100000001).dtype
dtype('int32')

But, if we "help" numpy it gets the correct result:

>> np.arange(1, 100000001, dtype=np.int64).sum()
500000005000000

The wrong result is obviously due to 32-bit integer overflowing.

edited Nov 10 '17 at 20:02

Dimitris Fasarakis Hilliard

150,925
31
268
253

answered Nov 10 '17 at 14:49

DeepSpace

78,697
11
109
154

That `500000005000000` looks strange there. Weirdly I get the same but if I try other numbers different than `100000001` I get good results. Anyhow, in this case I'd use `uint64` instead of `int64` – Ignacio Vergara Kausel Nov 10 '17 at 14:53
1

@IgnacioVergaraKausel could you explain the preference of uint64? – sunspots Nov 10 '17 at 14:58
1

Since you're not using negative part using an unsigned type would give you more headroom and "safety" against overflowing the allocated space. – Ignacio Vergara Kausel Nov 10 '17 at 15:00

score 5 · Answer 2 · edited Jun 20 '20 at 09:12

It isn't really that numpy has a hard time guessing things, it's just that the default int type is the same as C long type:

int_: Default integer type (same as C long; normally either int64 or int32)

For windows systems, longs are 32bit, even on 64bit builds (see here for more) so that's what's used by default, int32.

As DeepSpace suggested, setting dtype to int64 does the trick. This can be done either in arange or in the sum method.

Additionally, you wrote:

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

and this is explained by the fact that the second sum exceeds int32's max value while the first doesn't:

>> np.arange(1,65536).sum() < np.iinfo(np.int32).max
True    
>>> np.arange(1,65537).sum() < np.iinfo(np.int32).max
False

of course the Python calculation is correct due to Python 3's arbitrary precision ints.

This is why many of us weren't able to reproduce. On most Unixes the default int size for 64bit machines is int64 (since the C long is 64bits) therefore the sum of those ints was equal to the expected value.

So why does `np.arange(1, 100000001, dtype=np.int32).sum()` give the correct 64-bit result? (At least, it does on my machine: macOS, Python 3.6, NumPy 1.13.3) — Mark Dickinson, Nov 10 '17 at 15:15
@MarkDickinson the [`sum`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.sum.html#numpy.sum) method uses the default int size which on [most unix-like systems](https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models) is `int64`. Apparently, windows with `cygwin` should also use `int64` according to the model linked. — Dimitris Fasarakis Hilliard, Nov 10 '17 at 15:52

Sum of consecutive integers in numpy is incorrect

2 Answers2