Pandas series mean breaks down on too many values. (with bottleneck installed). why?

Question

Ran into a strange behaviour recently with pandas series when trying to calculate mean, with a large number of values loaded in memory as "float32". (Data too big to load as float64, etc etc, consider it necessary. Precision loss is acceptable.)

There is a difference in the values of mean calculated between pandas and numpy, but that part makes sense based on how it sums, as shown in this question

However, after a certain point, pandas mean just breaks down. MVCE with dummy data. (update: only with bottleneck installed. more details at the end)

import numpy as np
import pandas as pd

size_to_test = 100_000_000 #it goes bad
#size_to_test = 10_100_100 #acceptable with this

a = np.random.uniform(low = 5.0, high = 6.0, size = (size_to_test))

df = pd.DataFrame(a, dtype = "float32")
mean_pandas = df[0].mean()

mean_numpy = df[0].values.mean()

print(mean_pandas, mean_numpy)
print(f"difference is {mean_pandas - mean_numpy}")

With the code above, when i set size to 100_000_000, pandas mean starts giving values that are practically wrong, and misleading. (What's worse is that i initially encountered this while calling np.mean() on a series, and did not expect it to invoke pandas mean() but it seems to be the case). Here, is it impossible to have a mean value below 5 for points lying between 5 and 6.

#Output:
1.3421772718429565 5.4999743
difference is -4.1577969789505005

So, for future safety, use float64 if possible, circumventing the issue. call .values always while calculating mean. Noted.

What i would like to understand however, is why this behaviour happened in the first place?

You can ignore the code below:

#For reference: something curious
df[0].sum()/len(df[0]) #for some reason, this way works just fine too.
5.4999744

##The code below takes a while to run, you shouldn't run it. It is just for comparison
#import statistics
#statistics.mean(df[0])
##Output:
#5.499971017958364

Edit- I Ran this on: Pandas: 0.23.4, numpy: 1.15.4, python 3.7.1 Also, I am on windows if that affects anything.

UPDATE Okay, the issue is somehow related to a library called bottleneck , I created a test environment without bottleneck and the pandas mean essentially ends up with the same result as numpy. I will update the question to reflect the change. Would still like to understand what that library did to cause such a result, and how does it play into pandas. My bottleneck version is 1.2.1

perhaps include versions of python, pandas, and numpy. My result is: `5.4999757, 5.4999757, difference is 0.0` numpy==1.16.0, pandas==0.23.4, python3.6.8 — d_kennetz, Feb 22 '19 at 18:58
Cannot reproduce on numpy 1.16.1, pandas 0.23.3 and python 3.7.2. — Jan Christoph Terasa, Feb 22 '19 at 19:12
I get the correct result with numpy 1.15.2, pandas 0.23.4 and python 3.6.4 if that is of use. — FChm, Feb 22 '19 at 19:25

Pandas series mean breaks down on too many values. (with bottleneck installed). why?

0 Answers0