1

Using Python3.7 and the currently most updated version of Pandas. I have a dataframe with the following datatypes: [category, float, object(text)] all i want to do is fill NaN values for the entire dataframe at once.

What ive been doing on my own is going one-by-one through every single column (hundreds at a time) and grouping columnnames into lists organized by datatype. Then setting that list of columns with pd.astype(datatype). this was extremely tedious and inefficient, as i still continue to get back lots of errors. Ive been doing it this way for months, but now i have excel sheets with arbitrary data to read in, and considering the size of the dataframes im beginning to work with (+/-400k) its unrealistic to continue that way.

For the dtypes "category" and "object(text)", i want to fillna with the string 'empty'. And for float dtypes, i want to fillna with 0.0. At this point in my project, I am not yet interested in filling with mean/median values.

Ideally I would like to achieve this with something simple like:

df.fillna_all({'float':0, 'category':'empty', 'object':'empty'})

please help!

  • It would help to see a sample input and expected output as well as code to make a [mcve], since your description isn't entirely clear. That said, [select_dtypes()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) seems applicable – G. Anderson Dec 17 '20 at 16:55
  • Ill edit the question to include your suggestions thanks. As far as i understand select_dtypes will only select columns that have previously been defined as a specific type. when i ran df.select_dtypes(include=['float64']).columns.tolist() no data was returned, even though clearly there are columns that are float value. – data_the_goonie Dec 17 '20 at 17:03
  • In that case, please provide more information about how you're currently setting the dtype, since otherwise it's really hard to know how to help. See also [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Dec 17 '20 at 17:44

1 Answers1

1

I think this is exactly what you need:

1) To fill in the categorical variables with 'empty', you can do:

# Identify the columns in your df that are of type Object (i.e. categorical)
cat_vars = [col for col in df.columns if df[col].dtypes == 'O'] 

# Loop over them, and fill them with 'empty'
for col in df[cat_vars]:
    df[col].fillna('empty',inplace=True) 

2) To fill in the numerical variables with 0.0, you can do:

# Identify the columns that are numeric, AND have at least 1 nan to be filled
num_vars = [x for x in dat.columns if dat[x].dtypes !='O' and dat[x].isnull() > 0] 

# Loop over them, and fill them with 0.0
for col in df[num_vars]:
    df[col].fillna(0,inplace=True) 

For the future, if you are interested in filling the numeric variables with mean or median:

for col in df[num_vars]:
    df[col] = df[col].fillna(df[col].median()) # or replace with mean() for mean     
sophocles
  • 13,593
  • 3
  • 14
  • 33