-1

I'm working with .csv files to find rows with duplicate records along with missing DOBs and genders.

The .csv files I'll be working with do not have column headers.

  1. Load .csv file (No issues.)

  2. Add column headers (No issues.)

  3. Determine if there are duplicate rows (No issues with this code.)

  4. Determine if there are missing Dates of Birth using .isna. Print the SSN, Last Name, First Name and DOB. The DOB should be blank due to the inclusion of .isna('') (Getting a syntax error on this line)

If the boolean returns as "TRUE" (field is blank / null) to then print the partial row data listed in the dataframe to an Excel spreadsheet. I included the .fillna('') code to make the DOB field blank instead fo TRUE when they print to the spreadsheet. Each dataframe will have its own tab within the Excel workbook. The last block of code writing the separate tabs to the worksheet works fine.

  1. Determine if there are missing genders and print out EmpID, SSN, LastName, FirstName and Gender. (I assume the syntax error from 4) above would happen on this row as well.)

QUESTION: What is the best way to check the Date of Birth and Gender fields for nulls / blanks and then print out the specific fields? My code is included below.

#Import Pandas

import pandas as pd

# Read csv file using Pandas
df = pd.read_csv('data.csv', engine='python')

# Add column headers to csv file
df.columns = ["EmpID","SSN", "Status", "LastName", "FirstName", "Middle", "DOB", "Address1", "Address2", "City", "State", "Zip", "Phone", "Email", "Gender"]

# Dataframe 0 used to obtain list of duplicate rows referencing multiple columns
df0 = df[df[["EmpID", "SSN","DOB"]].duplicated() == True

# Dataframe 1 used to obtain list of missing Dates of Birth
df1 = df["DOB"].isna().fillna('')
if df1 == True: print(df.columns = "EmpID", "SSN", "LastName", "FirstName", "DOB")

## Dataframe 2 used to obtain list of missing Genders
df2=df["Gender"].isna().fillna('')
if df2 == True: print(df.columns = "EmpID", "SSN", "LastName", "FirstName", "Gender")

## Print the above 3 dataframes as separate tabs in one Excel file
with pd.ExcelWriter('output.xlsx') as writer:
    df0.to_excel(writer, sheet_name='Duplicates', index=True)
    df1.to_excel(writer, sheet_name='DOB', index=True)
    df2.to_excel(writer, sheet_name='Gender', index=True)
Corralien
  • 109,409
  • 8
  • 28
  • 52
db18145
  • 11
  • 3
  • See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – BeRT2me May 10 '23 at 05:25

1 Answers1

0

You can slice the DataFrame to just return where the "True" rows are. I think the issue with your code is that you are applying a fillna() to a boolean Series so it isn't going to fill anything.

In the code below, df0 stores the whole row that is duplicated, df1 returns the subset of just the Name and the Age (which will be NA).

import pandas as pd

data = [['Tom', 20, 'blue'], ['Dick', 40, 'red'], ['Dick', 40, 'yellow'], ['Harry', pd.NA, 'green']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Color'])

df0 = df[df.duplicated(subset=['Name', 'Age'])]
df1 = df.loc[df['Age'].isna(), ['Name', 'Age']]
Daniel Logan
  • 43
  • 1
  • 6