Is there an easy way to drop columns that are predominately NaN?
I have a dataframe with 129 columns, I would like to drop all columns with 50% or greater NaN values.
Is there an easy way to drop columns that are predominately NaN?
I have a dataframe with 129 columns, I would like to drop all columns with 50% or greater NaN values.
You could do it as follows:
import numpy as np import pandas as pd
# calculate the ratio of nans
nan_ratio= df.isna().mean(axis='index')
# get the columns above a threshold of nans
# here 50% or more NaNs
cols_to_drop= nan_ratio.index[nan_ratio>=0.5].to_list()
# remove the columns from the dataframe
df.drop(columns=cols_to_drop, inplace=True)
df
Testdata:
df= pd.DataFrame(
dict(
c1=np.random.randint(1,100, size=100),
c2=np.random.randint(1,30, size=100),
c3=np.random.randint(1,15, size=100),
c4=np.random.randint(1,16, size=100)
),
dtype='float32'
)
df= df.where(df > 10, np.nan)
For the test data, the code above would usually delete columns c3
and c4
(depending on the random numbers).
Thanks for the help. I've found another way as well:
threshold = 0.25
df = df[df.columns[df.isnull().mean() < threshold]]