1

I have a dataset from here: https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv

The code to load it is:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
fullht_df.head(n=100)
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)

enter image description here

I am trying to iterate through the gender (sex) column and replace the unknown value (?) with some sensible value. The value will be a calculated value of either "M" or "F" - depending upon some other algorithm that is not important to the question.

I am new to Pandas and for some reason this is proving more difficult than I ever could imagine.

What is the best way to iterate over the column series and test

Because there are many unknown values I have first replaced ? with np.NaN

# Replace with NaN so many of the Pandas functions will work.
ht_df = ht_df.replace('?', np.NaN) 

This let me update all the numeric missing values very nicely with the mean value (not important to this question except to explain why I replaced everything with NaN):

# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])

But now I am left with iterating down the "sex" column to replace and I cannot iterate over it nicely.

I used the following code to help me understand what is going on. I have only included a sample of the output.

for item in ht_df["sex"]:
   print(f"{item} {type(item)}")

Output:

F <class 'str'>
F <class 'str'>
... <snip> ...
F <class 'str'>
F <class 'str'>
M <class 'str'>
F <class 'str'>
nan <class 'float'>
F <class 'str'>

The nan is a float, which makes sense. But I am unable to test for it like this:

for item in ht_df["sex"]:
   if item == np.NaN:
      print(f"{item} is NaN\n")
   print(f"{item} {type(item)}")

The if condition is never triggered.

How can I test the value for NaN as I iterate over it and then update that cell with a new value?

A full test code is here:

import pandas as pd
import numpy as np
import ssl

from pandas.core.arrays import string_
from pandas.core.frame import DataFrame


def main():
    ssl._create_default_https_context = ssl._create_unverified_context

    url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
    fullht_df = pd.read_csv(url)

    print(fullht_df.head(n=100))

    # Get the first 500 rows from the dataset and use that for the rest of the assignment.
    ht_df = fullht_df.head(n=500)

    # Display the dataset's dimension
    print(f"Working dataset dimension is: {ht_df.shape}\n")

    # Get the first 500 rows from the dataset and use that for the rest of the assignment.
    ht_df = fullht_df.head(n=500)

    # Cells with missing data have a '?' in them. 
    # First replace ? with np.NaN so we can utilise some other nice Pandas dataframe methods. We can use a global replace because, upon dataset ins[ection, the unknown ('?') only exists in the numeric columns.
    # Convert the value columns from text to numeric.
    # Calculate the median value for the numeric-data coluimns
    # Replace the NaN values with a reasonable value. For this exercise we have chosen the mean for the column
    # Recalculate the median value for the numeric-data coluimns

    # Prepare the data so it is calculable
    ht_df = ht_df.replace('?', np.NaN)                                                        # Replace with NaN so many of the Pandas functions will work.
    ht_df[["TSH","T3","TT4","FTI"]] = ht_df[["TSH","T3","TT4","FTI"]].apply(pd.to_numeric)    # CSV loads as text. Convert the cells to numeric

    # Calculate the Mean and Median prior to replacing missing values
    mean = ht_df[["TSH","T3","TT4","FTI"]].mean(skipna=True)
    median = ht_df[["TSH","T3","TT4","FTI"]].median(skipna=True)

    # Replace the NaN's of the numeric columns with the mean
    ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
    ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
    ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
    ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])

    # Replace the M/F missing values with the most frequently occuring gender provided "pregnant" is false. Otherwise set the value to F.
    print("@@@@@@@@@@@@@@")
    for item in ht_df["sex"]:
        if item == np.NaN:
            print(f"{item} is NaN\n")
        print(f"{item} {type(item)}")
    print("@@@@@@@@@@@@@@")

if __name__ == "__main__":
    main()
Bryon
  • 939
  • 13
  • 25
  • Are you looking for [`.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html)? – Mr. T Mar 20 '22 at 10:29
  • Sort of - but I think this is a clue. I updated the condition to if item.isna(): The error returned is: AttributeError: 'str' object has no attribute 'isna' – Bryon Mar 20 '22 at 10:32
  • Actually - the NaN element is of type float. So there is no .isna() method on it. But I am getting closer – Bryon Mar 20 '22 at 10:37
  • As shown in the linked document, you use it with a series (column) or the entire dataframe to create a Boolean mask. The point of pandas is to avoid explicit looping and taking advantage of vectorized pandas operations. So, you would filter with Boolean mask and alter the values. – Mr. T Mar 20 '22 at 10:37
  • I still have to iterate over the result. But thanks to your forst comment I tracked down a way with: if type(item) == float and math.isnan(item) – Bryon Mar 20 '22 at 10:41
  • Why do you want to use a loop? use `.fillna(...)` – Corralien Mar 20 '22 at 10:42
  • I want to use a loop because I cannot change the nan with a single value. It is based on an algorithm (not shown in the sample). So sometimes it will be replaced with "M" and sometimes "F" – Bryon Mar 20 '22 at 10:43
  • You don't need to iterate to do that. See my answer. – Tom Mar 20 '22 at 10:44
  • I fixed your loop but I maintain you don't need no loop. Can you check my answer please? – Corralien Mar 20 '22 at 10:48
  • Does this answer your question? [Checking if particular value (in cell) is NaN in pandas DataFrame not working using ix or iloc](https://stackoverflow.com/questions/47440077/checking-if-particular-value-in-cell-is-nan-in-pandas-dataframe-not-working-us) – Mr. T Mar 20 '22 at 10:49

3 Answers3

0

I'm not sure why you want to use iteration to print out each item. But if all you want is a print out of the rows where the 'sex' column is np.nan then:

print(ht_df["sex"].isna()) 

will show "True" or "False" for every np.nan

If you want to just see the dataframe elements with those rows, you can try something like:

print(ht_df.loc[ht_df["sex"].isna(), ["sex", "pregnant", "TSH"]])

which would print all the rows that are np.nan in the sex column and then the values of those three columns (which I picked arbitrarily, you can fill with any list you like).

Lastly, if you want to create a formula to guess at whether a np.nan row should be M or F, I'd usually create a "sex_predict" column, fill that using whatever algo you are using, and then use fillna

ht_df["sex"] = ht_df["sex"].fillna(ht_df["sex_predict"])
Tom
  • 1,003
  • 2
  • 13
  • 25
  • Sorry "isna" ... that was a typo – Tom Mar 20 '22 at 10:45
  • I want to iterate because I am so used to procedural programming and I do not know what I am doing..... :-) Wow - working with data like this is a very different way of programming than I have ever done. No wonder I am scratching my head as I am learning it. But I get it. – Bryon Mar 20 '22 at 10:48
  • Just trying this out and wondering how it helps me. I still need to iterate through the ht_df["sex"] to work out if I need to use my formula right? Then I just put the output in the new column and update with fillna() as you suggest. Is that right? – Bryon Mar 20 '22 at 11:02
  • Very rare you have to iterate through the rows to get to what you want. If "calc_gender" were just df[x] + df[y] + df[z] (for sake of argument) you can just make df['sex-predict'] equal to that formula for ALL rows, and then fillna the "sex" column. It won't overwrite anything and be far far faster computationally. – Tom Mar 20 '22 at 14:01
  • Thanks. In this case it is based on an analysis of lots of other data so I do need a function and it seems like iterating is right. But I am going to play around with what you have written for the future. – Bryon Mar 21 '22 at 01:00
0

You can't check item == np.NaN, you have to use pd.isna(item):

for item in ht_df["sex"]:
    if pd.isna(item):
        print(f"{item} is NaN\n")
    print(f"{item} {type(item)}")

Output:

...
M <class 'str'>
F <class 'str'>
nan is NaN

nan <class 'float'>
F <class 'str'>
...
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

Thanks everyone (esp @Tom and @Corralien as they were both correct) for the answers. They were all very illuminating as to how I should be working with data and pandas. I combined the two into my solution below

tmp_col = "sex-predict"
ht_df[tmp_col] = ht_df["sex"]
for (index, row_series) in ht_df.iterrows():
    if pd.isna(row_series["sex"]):
        ht_df.at[index, tmp_col] = calc_gender().   # Calculate the value for the missing value.

# Copy over any NaN values in the sex column using the value from the temporary column
ht_df["sex"] = ht_df["sex"].fillna(ht_df[tmp_col])
ht_df = ht_df.drop([tmp_col], axis=1)       # Drop the temporary column
Bryon
  • 939
  • 13
  • 25