18

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.

df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]

DF for example:

col1  col2 col3
23     45  NaN
54     39  NaN
NaN    45  76
87     32  NaN

The output should look like:

 df.dropna(axis=1, thresh=2)

    col1  col2
    23     45  
    54     39  
    NaN    45  
    87     32  
miradulo
  • 28,857
  • 6
  • 80
  • 93
Hristo Stoychev
  • 405
  • 1
  • 5
  • 13
  • No data no use. Add sample data to illustrate your problem . – Bharath M Shetty Nov 21 '17 at 13:56
  • 1
    mate, you want me to add 200 columns to illustrate it NaN's? :) – Hristo Stoychev Nov 21 '17 at 14:00
  • You havent seen this https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples. Adding sample data makes understand problem properly – Bharath M Shetty Nov 21 '17 at 14:01
  • 1
    Possible duplicate of [How to drop column according to NAN percentage for dataframe?](https://stackoverflow.com/questions/43311555/how-to-drop-column-according-to-nan-percentage-for-dataframe) – miradulo Nov 21 '17 at 14:05
  • It is simple use `pd.isnull` with sum i.e `df.loc[:,pd.isnull(df).sum()<=2]`, adding sample data is always better :) And why are you not using `dropna`? Its giving you what you want right ? – Bharath M Shetty Nov 21 '17 at 14:13

8 Answers8

31

You can create with non-NaN columns using

df = df[df.columns[~df.isnull().all()]]

Or

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
Vaishali
  • 37,545
  • 5
  • 58
  • 86
13

df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.

df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.

df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.

df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!

fogx
  • 1,749
  • 2
  • 16
  • 38
noname
  • 186
  • 1
  • 5
2

I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.

I can think in the following code:

import pandas as pd

df = pd.DataFrame({
    'col1': [23, 54, pd.np.nan, 87],
    'col2': [45, 39, 45, 32],
    'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})

# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
    return col.isnull().sum() > threshold

# Then you apply the "complement" of function to get the column with
# no NaN.

df.loc[:, ~df.apply(has_nan)]

# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
lmolina
  • 41
  • 4
2

Here is a simple function which you can use directly by passing dataframe and threshold

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

l = rmissingvaluecol(df,1)
df1 = df[l]

PS: You can change threshold as per your requirement

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''
Community
  • 1
  • 1
Suhas_Pote
  • 3,620
  • 1
  • 23
  • 38
1

you should try df_notnull = df.dropna(how='all') This will get you only non null rows.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

Alex
  • 816
  • 5
  • 14
  • Well you a right mate, with df_notnull = df.dropna(axis=1) I can select all columns which do not contain any NaN row value, but when I add thresh parameter still showing all columns :( – Hristo Stoychev Nov 21 '17 at 14:05
0
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values

df = df[full_col_series.index]
MJP
  • 1,547
  • 2
  • 14
  • 21
0

This worked for me quite well and probably tailored for your need as well!

def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
    if df[j].isnull().sum() <= thresh:
        ind.append(j)
return df[ind]
0

I see a lot of how to get rid of null values on this thread. Which for my dataframes is never the case. We do not delete data. Ever.

I took the question as how to get just your null values to show, and in my case I had to find latitude and longitude and fill them in.

What I did was this for one column nulls:

df[df['Latitude'].isnull()]

or to explain it out

dataframe[dataframe['Column you want'].isnull()]

This pulled up my whole data frame and all the missing values of latitude.

What did not work is this and I can't explain why. Trying to do two columns at the same time:

df[df[['Latitude','Longitude']].isnull()]

That will give me all NANs in the entire data frame.

So to do this all at once what I added was the ID, in my case my ID for each row is APNs, with the two columns I needed at the end

df[df['Latitude'].isnull()][['APN','Latitude','Longitude']]

By doing this little hack I was able to get every ID I needed to add data too for 600,000+ rows of data to filter for. Then did it again for longitude just to be sure I did not miss anything.

JQTs
  • 142
  • 2
  • 11