8

I seem to have found a pitfall with using .sum() on numpy arrays but I'm unable to find an explanation. Essentially, if I try to sum a large array then I start getting nonsensical answers but this happens silently and I can't make sense of the output well enough to Google the cause.

For example, this works exactly as expected:

a = sum(xrange(2000)) 
print('a is {}'.format(a))

b = np.arange(2000).sum()
print('b is {}'.format(b))

Giving the same output for both:

a is 1999000
b is 1999000

However, this does not work:

c = sum(xrange(200000)) 
print('c is {}'.format(c))

d = np.arange(200000).sum()
print('d is {}'.format(d))

Giving the following output:

c is 19999900000
d is -1474936480

And on an even larger array, it's possible to get back a positive result. This is more insidious because I might not identify that something unusual was happening at all. For example this:

e = sum(xrange(100000000))
print('e is {}'.format(e))

f = np.arange(100000000).sum()
print('f is {}'.format(f))

Gives this:

e is 4999999950000000
f is 887459712

I guessed that this was to do with data types and indeed even using the python float seems to fix the problem:

e = sum(xrange(100000000))
print('e is {}'.format(e))

f = np.arange(100000000, dtype=float).sum()
print('f is {}'.format(f))

Giving:

e is 4999999950000000
f is 4.99999995e+15

I have no background in Comp. Sci. and found myself stuck (perhaps this is a dupe). Things I've tried:

  1. numpy arrays have a fixed size. Nope; this seems to show I should hit a MemoryError first.
  2. I might somehow have a 32-bit installation (probably not relevant); nope, I followed this and confirmed I have 64-bit.
  3. Other examples of weird sum behaviour; nope (?) I found this but I can't see how it applies.

Can someone please explain briefly what I'm missing and tell me what I need to read up on? Also, other than remembering to define a dtype each time, is there a way to stop this happening or give a warning?

Possibly relevant:

Windows 7

numpy 1.11.3

Running out of Enthought Canopy on Python 2.7.9

Community
  • 1
  • 1
roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • 1
    probably because `numpy` integers rely on C-type integers, whereas python has unlimited integer range. floats are ... floats. They can be very high. – Jean-François Fabre Jan 17 '17 at 19:55
  • 1
    Check `np.arange(5).dtype`. It's probably using 32-bit integers instead of 64-bit. Also, make sure you're performing all these checks on the same Python installation. – user2357112 Jan 17 '17 at 19:55
  • Looks like some kind of overflow... the sign of the integer seems to be overwritten as well which is might be the reason you get negative results sometimes. – Nicolas Heimann Jan 17 '17 at 19:57
  • @user2357112 indeed it is printing `dtype('int32')` – roganjosh Jan 17 '17 at 20:00

4 Answers4

7

On Windows (on 64-bit system too) the default integer NumPy uses if you convert from Python ints is 32-bit. On Linux and Mac it is 64-bit.

Specify a 64-bit integer and it will work:

d = np.arange(200000, dtype=np.int64).sum()
print('d is {}'.format(d))

Output:

c is 19999900000
d is 19999900000

While not most elegant, you can do some monkey patching, using functools.partial:

from functools import partial

np.arange = partial(np.arange, dtype=np.int64)

From now on np.arange works with 64-bit integers as default.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • 1
    This makes sense to me but one of my key questions was whether there was a way to flag this. It's far too easy for this to fall under my radar when debugging; is there a way to identify that I missed this? – roganjosh Jan 17 '17 at 20:03
  • Added a possible solution. – Mike Müller Jan 17 '17 at 20:36
  • Also see https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine, the default is to use C long int, which Windows defined as a 32bit integer, even on a 64bit CPU. – Martijn Pieters May 02 '20 at 00:17
6

This is clearly numpy's integer type overflowing 32-bits. Normally you can configure numpy to fail in such situations using np.seterr:

>>> import numpy as np
>>> np.seterr(over='raise')
{'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
>>> np.int8(127) + np.int8(2)
FloatingPointError: overflow encountered in byte_scalars

However, sum is explicitly documented with the behaviour "No error is raised on overflow", so you might be out of luck here. Using numpy is often a trade-off of performance for convenience!

You can however manually specify the dtype for the accumulator, like this:

>>> a = np.ones(129)
>>> a.sum(dtype=np.int8)  # will overflow
-127
>>> a.sum(dtype=np.int64)  # no overflow
129

Watch ticket #593, because this is an open issue and it might be fixed by numpy devs sometime.

wim
  • 338,267
  • 99
  • 616
  • 750
  • Thanks a lot for including the ticket. I suspected it was an overflow but the lack of any warning made me question what was really happening. That, combined with not knowing that Windows automatically converts int to 32-bit left me confused. – roganjosh Jan 19 '17 at 09:54
3

I'm not a numpy expert, but can reproduce your arange(200000) result in pure Python:

>>> s = 0
>>> for i in range(200000):
...     s += i
...     s &= 0xffffffff
>>> s
2820030816
>>> s.bit_length()
32
>>> s - 2**32  # adjust for that "the sign bit" is set
-1474936480

In other words, the result you're seeing is what I expect if numpy is doing its arithmetic on signed 2's-complement 32-bit integers.

Since I'm not a numpy expert, I can't suggest a good approach to never getting surprised (I would have left this as a comment, but I couldn't show nicely formatted code then).

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 1
    Start by searching for "two's complement". With various fixed bit widths (8, 16, 32, 64), that's the scheme virtually all computer hardware uses to represent signed integers. In a low-level language like `C` (which CPython and numpy are coded in), native hardware types like that are the ones you work with. It takes a lot of code under the covers to implement Python's illusion of infinitely-wide two's complement integers. – Tim Peters Jan 18 '17 at 04:14
  • Many thanks, I'm reading into it now. That "lot of code under the covers" has shielded me from a lot, thanks :) In this case, Wim has linked to an open ticket with `numpy` in his answer; it's been recognised that there is no overflow error in this particular case (which is what threw me) but it has been open for many years now :( – roganjosh Jan 19 '17 at 10:05
2

Numpy's default integer type is the same as the C long type. Now, this isn't guaranteed to be 64-bits on a 64-bit platform. In fact, on Windows, long is always 32-bits.

As a result, the numpy sum is overflowing the value and looping back around.

Unfortunately, as far as I know, there is no way to change the default dtype. You'll have to specify it as np.int64 every time.

You could try to create your own arange:

def arange(*args, **kw):
    return np.arange(dtype=np.int64, *args, **kw)

and then use that version instead of numpy's.

EDIT: If you want to flag this, you could just put something like this in the top of your code:

assert np.array(0).dtype.name != 'int32', 'This needs to be run with 64-bit integers!'
kirbyfan64sos
  • 10,377
  • 6
  • 54
  • 75