3

Ran into a strange behaviour recently with pandas series when trying to calculate mean, with a large number of values loaded in memory as "float32". (Data too big to load as float64, etc etc, consider it necessary. Precision loss is acceptable.)

There is a difference in the values of mean calculated between pandas and numpy, but that part makes sense based on how it sums, as shown in this question

However, after a certain point, pandas mean just breaks down. MVCE with dummy data. (update: only with bottleneck installed. more details at the end)

import numpy as np
import pandas as pd

size_to_test = 100_000_000 #it goes bad
#size_to_test = 10_100_100 #acceptable with this

a = np.random.uniform(low = 5.0, high = 6.0, size = (size_to_test))

df = pd.DataFrame(a, dtype = "float32")
mean_pandas = df[0].mean()

mean_numpy = df[0].values.mean()

print(mean_pandas, mean_numpy)
print(f"difference is {mean_pandas - mean_numpy}")

With the code above, when i set size to 100_000_000, pandas mean starts giving values that are practically wrong, and misleading. (What's worse is that i initially encountered this while calling np.mean() on a series, and did not expect it to invoke pandas mean() but it seems to be the case). Here, is it impossible to have a mean value below 5 for points lying between 5 and 6.

#Output:
1.3421772718429565 5.4999743
difference is -4.1577969789505005

So, for future safety, use float64 if possible, circumventing the issue. call .values always while calculating mean. Noted.

What i would like to understand however, is why this behaviour happened in the first place?

You can ignore the code below:

#For reference: something curious
df[0].sum()/len(df[0]) #for some reason, this way works just fine too.
5.4999744

##The code below takes a while to run, you shouldn't run it. It is just for comparison
#import statistics
#statistics.mean(df[0])
##Output:
#5.499971017958364

Edit- I Ran this on: Pandas: 0.23.4, numpy: 1.15.4, python 3.7.1 Also, I am on windows if that affects anything.

UPDATE Okay, the issue is somehow related to a library called bottleneck , I created a test environment without bottleneck and the pandas mean essentially ends up with the same result as numpy. I will update the question to reflect the change. Would still like to understand what that library did to cause such a result, and how does it play into pandas. My bottleneck version is 1.2.1

Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33

0 Answers0