4

I have a large DataFrame (circa 4e+07 rows).

When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31

df[[col1, col2, col3]].sum()
Out[1]:
col1         9.36e+07
col2         1.39e+09
col3         6.37e+08
dtype: float32

df.sum()[[col1, col2, col3]]
Out[2]:
col1         1.21e+08
col2         1.70e+09
col3         7.32e+08
dtype: float64

I am obviously missing something, has anybody had the same issue?

Thanks for your help.

Fafon
  • 43
  • 1
  • 3
  • 2
    Take a look at your `dtypes`. The first one is `float32` while the second is `float64`. That might have something to do with it. – cs95 Oct 27 '17 at 07:59
  • You are right - and I thought the limit was 2**31 - which might be right for integers, but not floats. Thanks for your help and thanks again to @MarkDickinson for pointing out the quirks of floating point arithmetics. – Fafon Oct 28 '17 at 13:34

2 Answers2

6

To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:

Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:

  1. Extracts the values of the dataframe into a two-dimensional NumPy array.
  2. Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.

It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.

On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.

Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:

>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
     A    B  C
0  1.0  1.0  1
1  1.0  1.0  1
2  1.0  1.0  1
3  1.0  1.0  1
4  1.0  1.0  1
>>> df.dtypes
A    float32
B    float32
C      int32
dtype: object

Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.

>>> df.sum()
A    100000000.0
B    100000000.0
C    100000000.0
dtype: float64

But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.

>>> df[['A', 'B']].sum()
A    16777216.0
B    16777216.0
dtype: float32

The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.

However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":

>>> df[['A']].sum()
A    100000000.0
dtype: float32

Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.

In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.

Mark Dickinson
  • 29,088
  • 9
  • 83
  • 120
  • Mark, thanks for your answer, it could not be clearer. I was suspecting it had to do with the types, but could not figure it out, also the Numpy pairwise summation you explained was muddying the waters further... Again, many thanks for your help and for sharing your wealth of knowledge :-) – Fafon Oct 28 '17 at 11:02
  • Just an additional question: my numbers add up to circa 10e+08. Doing `df[["A", "B"]].astype(numpy.float64)`definitely works, but using `float32` would not, even though (as per @piRSquared answer) the maximum number acceptable by this type is circa 10e+38 and the precision 6, orders of magnitude better than the results I get. Why is `float32`not sufficient? – Fafon Oct 28 '17 at 18:04
  • @Fafon: See the linked "usual floating-point problems" question. While the max of the `float32` type is, as you say, around 1e38, it only has sufficient bits for around 7 decimal digits of precision. So e.g. while `10**8` is exactly representable as a `float32`, the next largest representable `float32` value is `100000008`. So if you add `1.0` to `10**8`, the result can't be represented exactly, and gets rounded to the nearest value that _can_ be represented exactly, which is `10**8` again. – Mark Dickinson Oct 30 '17 at 20:49
3

You can lose precision with np.float32 relative to np.float64

np.finfo(np.float32)

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

And

np.finfo(np.float64)

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

A contrived example

df = pd.DataFrame(dict(
    x=[-60499999.315, 60500002.685] * int(2e7),
    y=[-60499999.315, 60500002.685] * int(2e7),
    z=[-60499999.315, 60500002.685] * int(2e7),
)).astype(dict(x=np.float64, y=np.float32, z=np.float32))

print(df.sum()[['y', 'z']], df[['y', 'z']].sum(), sep='\n\n')

y    80000000.0
z    80000000.0
dtype: float64

y    67108864.0
z    67108864.0
dtype: float32
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks, will use this finfo function, I never figured out how to know the limits of each type. Many thanks for your help. – Fafon Oct 28 '17 at 13:29