4

In summing the first 100,000,000 positive integers using the following:

import numpy as np
np.arange(1,100000001).sum()

I return: 987459712, which does not match the formula: N(N+1)/2 for N=100000000. Namely, the formula returns 5000000050000000.

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

For integers greater than 65536 the code returns False, whereas integers below this threshold return True.

Could someone explain either what I've done wrong in calculating the sum, or what is going on with the code?

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
sunspots
  • 1,047
  • 13
  • 29
  • This feels like an implementation-specific behavior. What version and implementation of Python are you on? I could correctly compute the expected result using CPython 3.6.3 on a 64-bit machine. – bow Nov 10 '17 at 14:44
  • I'd guess numpy assumes ints given the numbers. But usually int can hold up to 65536. After that you have an overflow and you get inconsistencies. – Ignacio Vergara Kausel Nov 10 '17 at 14:44
  • And to also clarify, I'm on numpy version 1.12.1 – bow Nov 10 '17 at 14:45
  • 1
    I'm also using numpy version 1.12.1. The python version is 3.6.1 – sunspots Nov 10 '17 at 14:50

2 Answers2

5

Seems like numpy sometimes has a hard time guessing the correct datatype.

On my system, Win 10 64-bit, Python 3.4.4, numpy 1.13.1:

>>  np.arange(1, 100000001).sum()
987459712
>>  np.arange(1, 100000001).dtype
dtype('int32')

But, if we "help" numpy it gets the correct result:

>> np.arange(1, 100000001, dtype=np.int64).sum()
500000005000000

The wrong result is obviously due to 32-bit integer overflowing.

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
5

It isn't really that numpy has a hard time guessing things, it's just that the default int type is the same as C long type:

int_: Default integer type (same as C long; normally either int64 or int32)

For windows systems, longs are 32bit, even on 64bit builds (see here for more) so that's what's used by default, int32.

As DeepSpace suggested, setting dtype to int64 does the trick. This can be done either in arange or in the sum method.


Additionally, you wrote:

Before posting, I wrote the following, which returns True:

np.arange(1,65536).sum() == ((65535+1) * 65535)/2

However, the number 65536 seems to be a critical point, as

np.arange(1,65537).sum() == ((65536+1) * 65536)/2

returns False.

and this is explained by the fact that the second sum exceeds int32's max value while the first doesn't:

>> np.arange(1,65536).sum() < np.iinfo(np.int32).max
True    
>>> np.arange(1,65537).sum() < np.iinfo(np.int32).max
False

of course the Python calculation is correct due to Python 3's arbitrary precision ints.


This is why many of us weren't able to reproduce. On most Unixes the default int size for 64bit machines is int64 (since the C long is 64bits) therefore the sum of those ints was equal to the expected value.

Community
  • 1
  • 1
Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
  • 1
    So why does `np.arange(1, 100000001, dtype=np.int32).sum()` give the correct 64-bit result? (At least, it does on my machine: macOS, Python 3.6, NumPy 1.13.3) – Mark Dickinson Nov 10 '17 at 15:15
  • 2
    @MarkDickinson the [`sum`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.sum.html#numpy.sum) method uses the default int size which on [most unix-like systems](https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models) is `int64`. Apparently, windows with `cygwin` should also use `int64` according to the model linked. – Dimitris Fasarakis Hilliard Nov 10 '17 at 15:52