Looks like your data has a variation between Not Provided
and Not provided
with a lower p
, so is better to use a df.replace
with regex=True
including this alternation ([pP]rovided
). The choice of replace value depends on how you gonna handle your data, but here I used np.NaN
from numpy library. After the replace you can use fillna(0)
on the column with BasePay
, however the column type is object
so it needs to be interpreted as float (.astype(float)
) before calculating the mean.
Input used as data.csv
ID Name Occupation Pay OvertimePay BasePay Val4 Val5
0 148646 Carolyn A Wilson Human Services Technician 0 0 0 0 0.00
1 148647 Not provided Not provided Not Provided Not Provided Not Provided Not Provided 0.00
2 148648 Joann Anderson Communications Dispatcher 2 0 0 0 0 0.00
3 148649 Leon Walker Custodian 0 0 0 0 0.00
4 148650 Roy 1 Tillery Custodian 0 0 0 0 0.00
5 148651 Not provided Not provided Not Provided Not Provided Not Provided Not Provided 0.00
6 148652 Not provided Not provided Not Provided Not Provided Not Provided Not Provided 0.00
7 148653 Not provided Not provided Not Provided Not Provided Not Provided Not Provided 0.00
8 148654 Joe Lopez Counselor, Log Cabin Ranch 0 0 -618.13 0 -618.13
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", sep=";")
print(df)
df = df.replace("Not [pP]rovided", np.NaN, regex=True)
BasePay_mean = df["BasePay"].fillna(0).astype(float).mean()
print(BasePay_mean)
Output from BasePay_mean
-68.68111111111111