3

Using the Pandas or Numpy library, is there a way to code/specify different categories of missing (i.e. .nan ) values?

This is easily done in the STATA statistical software environment, where missing values range from '.a' - '.z', and this approach often used when you are interested in specifying different types of 'missingness' in your models (participant lost due to attrition, participant refused to take test, etc)

However I'm trying to transition from proprietary STATA to the more open Python and am wondering if you can do something similar? For those interested, I think there is also a way to specify missingness codes in R.

As a very simple example, say you have the pandas dataframe below, a sample of 4 subjects aged between 18-40, from which you collected weight and heart rate. Let's say that by the design of your study, you decided to only collect heart rate from the participants who are older than 35 years. Thus, subjects 3 and 4 have missing (np.nan) values for heart rate by design, whereas subject 1 has a missing value for some unknown reason. Let's say you plan to collect data on 1000 more subjects, and would like to distinguish between missing values at random, vs missing values by design? I really like the way that Pandas handles NaN values, for example by intuitively excluding them in groupby operations. Not to mention, NaN values appear to be recognized as the defacto 'missingnesss' value in the Python community. Is there a way to keep missing-values as NaN, while specifying different types or categories of NaN values?

import numpy as np
    import pandas as pd

    my_dict = {
        'subject':['sub-01','sub-02','sub-03','sub-04'],
        'age_in_years':[18,20,38,40],
        'weight_in_lbs':[120,160,200,240],
        'resting_heart_rate':[np.nan,65, np.nan,np.nan]
    }
    df = pd.DataFrame(my_dict)
    df.set_index('subject',inplace=True)
    print(df)

Side note: Some analysts (particularly those who are used to using SPSS) might replace NaN values with integer codes such as -999,-888, to indicate missingness values, but then you run the risk of a naive user of the dataset accidentally including these values in some model or arithmetic, because python will treat this as an integer rather than as a NaN value. Others might recode the NaN values with strings like '.a', '.b', or 'missing_refused', 'missing_noshow' to indicate missing values, but this quickly gets annoying when for example trying derive column means (if the non-missing values are integers/floats).

Has anyone else out there encountered this problem before? I'm interested in hearing others approaches to similar scenarios!

seh33
  • 127
  • 1
  • 6

0 Answers0