2

Is there an easy way to drop columns that are predominately NaN?

I have a dataframe with 129 columns, I would like to drop all columns with 50% or greater NaN values.

jottbe
  • 4,228
  • 1
  • 15
  • 31
  • Does this help you? https://stackoverflow.com/questions/31614804/how-to-delete-a-column-in-pandas-dataframe-based-on-a-condition – picmate 涅 Nov 26 '20 at 20:33

2 Answers2

0

You could do it as follows:

import numpy as np import pandas as pd

# calculate the ratio of nans
nan_ratio= df.isna().mean(axis='index')
# get the columns above a threshold of nans
# here 50% or more NaNs
cols_to_drop= nan_ratio.index[nan_ratio>=0.5].to_list()
# remove the columns from the dataframe
df.drop(columns=cols_to_drop, inplace=True)
df

Testdata:

df= pd.DataFrame(
    dict(
        c1=np.random.randint(1,100, size=100), 
        c2=np.random.randint(1,30, size=100), 
        c3=np.random.randint(1,15, size=100),
        c4=np.random.randint(1,16, size=100)
    ), 
    dtype='float32'
)
df= df.where(df > 10, np.nan)

For the test data, the code above would usually delete columns c3 and c4 (depending on the random numbers).

jottbe
  • 4,228
  • 1
  • 15
  • 31
0

Thanks for the help. I've found another way as well:

threshold = 0.25
df = df[df.columns[df.isnull().mean() < threshold]]