0

As you can see in the code below, I calculate variance for data in the 'open' column two different ways. The only difference being that in the second version I grab the values rather than the column containing values. Why would this lead to different variance calculations?

apple_prices = pd.read_csv('apple_prices.csv')

print(apple_prices['open'].values.var())
#prints 102.22564310059172

print(apple_prices['open'].var())
#prints 103.82291877403847
khelwood
  • 55,782
  • 14
  • 81
  • 108

1 Answers1

2

The reason for the difference is because that pandas.Series.var has a default ddof (delta degrees of freedom) of 1, and numpy.ndarray.var has a default ddof of 0. Manually setting this produces the same result:

import pandas as pd
import numpy as np
np.random.seed(0)

x = pd.Series(np.random.rand(100))

print(x.var(ddof=1))
# 0.08395738934787107


print(x.values.var(ddof=1))
# 0.08395738934787107

See the documentation at:
pandas.Series.var
numpy.var

Cameron Riddell
  • 10,942
  • 9
  • 19