9

Let me start with the example code:

import numpy
from pandas import DataFrame

a = DataFrame({"nums": [2233, -23160, -43608]})

a.nums = numpy.int64(a.nums)

print(a.nums ** 2)
print((a.nums ** 2).sum())

On my local machine, and other devs' machines, this works as expected and prints out:

0       4986289
1     536385600
2    1901657664
Name: nums, dtype: int64
2443029553

However, on our production server, we get:

0       4986289
1     536385600
2    1901657664
Name: nums, dtype: int64
-1851937743

Which is 32-bit integer overflow, despite it being an int64.

The production server is using the same versions of python, numpy, pandas, etc. It's a 64-bit Windows Server 2012 OS and everything reports 64-bit (e.g. python --version, sys.maxsize, plastform.architecture).

What could possibly be causing this?

Sean Kramer
  • 133
  • 7
  • Why don't you use regular Python integers that are capable of representing arbitrarily large numbers? – ForceBru Apr 20 '17 at 16:55
  • We were depending on numpy's infinity calculation when you div by zero. Regardless, this is a workaround. The real issue here is that our dev and prod environments aren't behaving identically. – Sean Kramer Apr 20 '17 at 17:01
  • 4
    @ForceBru: They're slow, bulky, and cause weird breakages if you try to use object arrays full of integer objects. – user2357112 Apr 20 '17 at 17:03
  • 1
    Does one of the machines have `bottleneck` installed? – user2357112 Apr 20 '17 at 17:15
  • Might be related http://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine – ayhan Apr 20 '17 at 17:21
  • 1
    What is the output of `print((a.nums.values**2).sum(dtype=np.int64))`? – Warren Weckesser Apr 20 '17 at 17:22
  • @WarrenWeckesser the correct value (2443029553)! So why would sum default to int32 (perhaps like the post ayhan mentioned) on the prod server but then use the actual data type locally. – Sean Kramer Apr 20 '17 at 17:25
  • @user2357112 yes, the prod server has bottleneck installed. I'll try removing it and see what if it works. – Sean Kramer Apr 20 '17 at 17:28
  • @user2357112 That did it! How'd you know there was a bug in bottleneck? – Sean Kramer Apr 20 '17 at 17:30
  • 2
    @SeanKramer: I just started digging through the code and wound up in bottleneck. I think bottleneck is mishandling `numpy.int64` on platforms where a C long is 32-bit, and Pandas is getting a check wrong in its attempts to compensate for bottleneck's error. – user2357112 Apr 20 '17 at 17:32
  • It looks like there's a [Github issue](https://github.com/pandas-dev/pandas/issues/15453) about this on the Pandas issue tracker. I'm pretty sure they should have checked itemsize==8 instead of itemsize<8; bottleneck should be fine for itemsize<8. – user2357112 Apr 20 '17 at 17:37
  • @user2357112 Not sure about the proper etiquette, but if you want to post that finding as an answer, I'll mark this as solved. – Sean Kramer Apr 20 '17 at 17:38
  • There's also a [Github issue](https://github.com/kwgoodman/bottleneck/issues/163) on the Bottleneck issue tracker. – user2357112 Apr 20 '17 at 17:46

1 Answers1

6

This is a bug in the bottleneck library, which Pandas uses if it's installed. In some circumstances, bottleneck.nansum incorrectly has 32-bit overflow behavior when called on 64-bit input.

I believe this is due to bottleneck using PyInt_FromLong even when long is 32-bit. I'm not sure why that even compiles, actually. There's an issue report on the bottleneck issue tracker, not yet fixed, as well as an issue report on the Pandas issue tracker, where they tried to compensate for Bottleneck's issue (but I think they turned off Bottleneck when it does work instead of when it doesn't).

user2357112
  • 260,549
  • 28
  • 431
  • 505