0

I am working on an assignment that is meant to help me familiarize myself with pandas, and the portion I am stuck on wants me to find the sample variance of Y. It says I must draft the Python/Pandas statement for this step and provides a hint (the dframe.count() method may be useful here). I know that the sample variance is the sum of squared differences divided by one less than the number of elements in the sample.

import pandas as pd
datafile='/Users/austinite/Desktop/Assignment1Data.csv'
frame = pd.read_csv(datafile)

yMean = frame['Y'].mean()
frame['Diff'] = frame['Y'] - yMean
frame['DiffSqr'] = frame['Diff'].pow(2)

sumSqrDiff = frame['DiffSqr'].sum()
sampleVariance = sumSqrDiff / (frame.count(axis='columns') - 1)`

This is the code that I have as of now. I have tried doing (axis='Y') because I thought that it would take the number of values in that column but that didn't work because it says Y is not defined. I then thought maybe using columns would work, and although it seems to work, it provides a list of the same value 300x.

Edit to add solution:

n = frame['Y'].count()
sampleVariance = sumSqrDiff / (n - 1)
Austinite
  • 1
  • 3
  • If you want the number of rows, just use `len(frame)`? (COUNT will operate on each column separately, skip nulls, etc, so just use len?) Pandas can calculate variance natively though; https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.var.html – MatBailie Mar 12 '23 at 22:37
  • 1
    If you MUST use count, count a single column, not a dataframe; `frame['Y'].count()` – MatBailie Mar 12 '23 at 22:44
  • I did see that, I think it is just meant to help us understand the different methods. I am going to edit my post, because I figured out how to fix my problem. – Austinite Mar 12 '23 at 22:46
  • Thank you! Yes, I realized I was using it incorrectly and that fixed my problem :) – Austinite Mar 12 '23 at 22:49

0 Answers0