I have a dataset from here: https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv
The code to load it is:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
fullht_df.head(n=100)
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
I am trying to iterate through the gender (sex) column and replace the unknown value (?) with some sensible value. The value will be a calculated value of either "M" or "F" - depending upon some other algorithm that is not important to the question.
I am new to Pandas and for some reason this is proving more difficult than I ever could imagine.
What is the best way to iterate over the column series and test
Because there are many unknown values I have first replaced ? with np.NaN
# Replace with NaN so many of the Pandas functions will work.
ht_df = ht_df.replace('?', np.NaN)
This let me update all the numeric missing values very nicely with the mean value (not important to this question except to explain why I replaced everything with NaN):
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])
But now I am left with iterating down the "sex" column to replace and I cannot iterate over it nicely.
I used the following code to help me understand what is going on. I have only included a sample of the output.
for item in ht_df["sex"]:
print(f"{item} {type(item)}")
Output:
F <class 'str'>
F <class 'str'>
... <snip> ...
F <class 'str'>
F <class 'str'>
M <class 'str'>
F <class 'str'>
nan <class 'float'>
F <class 'str'>
The nan is a float, which makes sense. But I am unable to test for it like this:
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
The if condition is never triggered.
How can I test the value for NaN as I iterate over it and then update that cell with a new value?
A full test code is here:
import pandas as pd
import numpy as np
import ssl
from pandas.core.arrays import string_
from pandas.core.frame import DataFrame
def main():
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
print(fullht_df.head(n=100))
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Display the dataset's dimension
print(f"Working dataset dimension is: {ht_df.shape}\n")
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Cells with missing data have a '?' in them.
# First replace ? with np.NaN so we can utilise some other nice Pandas dataframe methods. We can use a global replace because, upon dataset ins[ection, the unknown ('?') only exists in the numeric columns.
# Convert the value columns from text to numeric.
# Calculate the median value for the numeric-data coluimns
# Replace the NaN values with a reasonable value. For this exercise we have chosen the mean for the column
# Recalculate the median value for the numeric-data coluimns
# Prepare the data so it is calculable
ht_df = ht_df.replace('?', np.NaN) # Replace with NaN so many of the Pandas functions will work.
ht_df[["TSH","T3","TT4","FTI"]] = ht_df[["TSH","T3","TT4","FTI"]].apply(pd.to_numeric) # CSV loads as text. Convert the cells to numeric
# Calculate the Mean and Median prior to replacing missing values
mean = ht_df[["TSH","T3","TT4","FTI"]].mean(skipna=True)
median = ht_df[["TSH","T3","TT4","FTI"]].median(skipna=True)
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])
# Replace the M/F missing values with the most frequently occuring gender provided "pregnant" is false. Otherwise set the value to F.
print("@@@@@@@@@@@@@@")
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
print("@@@@@@@@@@@@@@")
if __name__ == "__main__":
main()