How to clean some columns of my dataset from a particular string?

Question

I am trying to clean my dataset. It has few columns which have all String/float/int (for name/pay/overtimepay) values but few rows contain 'Not provided' as a value.

newData = data[~data['BasePay'].isin(['Not Provided'])]
newData = data[~data['BasePay'].str.contains('Not Provided', na=False)]
print(newData['BasePay'].mean())

and

df = data.select_dtypes(object)
mask = ~df.apply(lambda series: series.str.contains('Not Provided')).any(axis=1)
no_eco = data[mask]

to clean the column of 'BasePay' and calculate mean but failed. I'd appreciate if you could help. Image of the concerned rows in my dataset

StackOverFlow Pandas 101 advice: don't paste images! This will increase your chances of getting a good answer... and likely within 10 minutes of asking... Copy and paste the actual data into your question... https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — David Erickson, Jul 07 '20 at 21:07
Better yet, paste the code to generate a dummy / sample dataset with few rows, so that people can copy paste it and quickly workout and answer for you. — Akshay Sehgal, Jul 10 '20 at 00:03

score 1 · Accepted Answer · answered Jul 10 '20 at 00:00

Looks like your data has a variation between Not Provided and Not provided with a lower p, so is better to use a df.replace with regex=True including this alternation ([pP]rovided). The choice of replace value depends on how you gonna handle your data, but here I used np.NaN from numpy library. After the replace you can use fillna(0) on the column with BasePay, however the column type is object so it needs to be interpreted as float (.astype(float)) before calculating the mean.

Input used as data.csv

       ID               Name                    Occupation            Pay    OvertimePay        BasePay           Val4    Val5
0  148646   Carolyn A Wilson     Human Services Technician              0              0              0              0    0.00
1  148647       Not provided                  Not provided   Not Provided   Not Provided   Not Provided   Not Provided    0.00
2  148648     Joann Anderson   Communications Dispatcher 2              0              0              0              0    0.00
3  148649        Leon Walker                     Custodian              0              0              0              0    0.00
4  148650      Roy 1 Tillery                     Custodian              0              0              0              0    0.00
5  148651       Not provided                  Not provided   Not Provided   Not Provided   Not Provided   Not Provided    0.00
6  148652       Not provided                  Not provided   Not Provided   Not Provided   Not Provided   Not Provided    0.00
7  148653       Not provided                  Not provided   Not Provided   Not Provided   Not Provided   Not Provided    0.00
8  148654          Joe Lopez    Counselor, Log Cabin Ranch              0              0        -618.13              0 -618.13

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", sep=";")
print(df)

df = df.replace("Not [pP]rovided", np.NaN, regex=True)

BasePay_mean = df["BasePay"].fillna(0).astype(float).mean()
print(BasePay_mean)

Output from BasePay_mean

-68.68111111111111

How to clean some columns of my dataset from a particular string?

1 Answers1