29

I am using the pandas.DataFrame.dropna method to drop rows that contain NaN. This function returns a dataframe that excludes the dropped rows, as shown in the documentation.

How can I store a copy of the dropped rows as a separate dataframe? Is:

mydataframe[pd.isnull(['list', 'of', 'columns'])]

always guaranteed to return the same rows that dropna drops, assuming that dropna is called with subset=['list', 'of', 'columns'] ?

wesanyer
  • 982
  • 1
  • 6
  • 27

2 Answers2

41

You can do this by indexing the original DataFrame by using the unary ~ (invert) operator to give the inverse of the NA free DataFrame.

na_free = df.dropna()
only_na = df[~df.index.isin(na_free.index)]

Another option would be to use the ufunc implementation of ~.

only_na = df[np.invert(df.index.isin(na_free.index))]
tsherwen
  • 1,076
  • 16
  • 21
anmol
  • 751
  • 6
  • 7
  • I really like this solution to the question. Note that the `-` has been deprecated in favor of `~` though. – johnchase Dec 15 '15 at 20:47
  • This is very close to what I want to achieve, but it doesn't quite work for me the way I'd like it to becauyse it pulls out the `na`/`nan` values, regardless of whether there are other values present in the row. I want to extract only those rows where every value is `na`/`nan` - how can this be achieved? – Mus Apr 19 '18 at 08:34
  • You can use `df.dropna(how='all')` to achieve this. see the panda docs for more info https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html – anmol Apr 19 '18 at 16:42
5

Start with the following data frame:

import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'b', np.nan], [np.nan, 'c', 'c'], ['c', 'd', 'a']],
              columns=['col1', 'col2', 'col3'])
df
  col1 col2 col3
0    a    b  NaN
1  NaN    c    c
2    c    d    a

And say we want to keep rows with Nans in the columns col2 and col3 One way to do this is the following: which is based on the answers from this post

df.loc[pd.isnull(df[['col2', 'col3']]).any(axis=1)]

  col1 col2 col3
0    a    b  NaN

So this gives us the rows that would be dropped if we dropped rows with Nans in the columns of interest. To keep the columns we can run the same code, but use a ~ to invert the selection

df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)]

  col1 col2 col3
1  NaN    c    c
2    c    d    a

this is equivalent to:

df.dropna(subset=['col2', 'col3'])

Which we can test:

df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])

True

You can of course test this on your own larger dataframes but should get the same answer.

paradocslover
  • 2,932
  • 3
  • 18
  • 44
johnchase
  • 13,155
  • 6
  • 38
  • 64