Subtle differences in data calcuations from array vs list

Question

As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?

apple_prices = pd.read_csv('apple_prices.csv')

print(apple_prices['open'].values.var())
#prints 102.22564310059172

print(apple_prices['open'].var())
#prints 103.82291877403847

Pandas and numpy have different default values for degrees of freedom. — rafaelc, Oct 13 '20 at 18:44

score 2 · Accepted Answer · answered Oct 13 '20 at 18:43

The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:

import pandas as pd
import numpy as np
np.random.seed(0)

x = pd.Series(np.random.rand(100))

print(x.var(ddof=1))
# 0.08395738934787107


print(x.values.var(ddof=1))
# 0.08395738934787107

See the documentation at:
pandas.Series.var
numpy.var

Subtle differences in data calcuations from array vs list

1 Answers1