How to drop rows of Pandas DataFrame whose value in a certain column is NaN

Question

I have this DataFrame and want only the records whose EPS column is not NaN:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

...i.e. something like df.drop(....) to get this resulting dataframe:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?

dropna: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html — Wouter Overmeire, Nov 16 '12 at 09:29
`df.dropna(subset = ['column1_name', 'column2_name', 'column3_name'])` — Sergey Orshanskiy, Sep 05 '14 at 23:53
Another ruthless way if you hate NaN so much `df = df.dropna(subset=df.columns.values)` and you find there are no NaN anywhere — dejjub-AIS, Oct 01 '22 at 18:55

score 1556 · Accepted Answer · edited Feb 16 '20 at 07:46

1556

Don't drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]

edited Feb 16 '20 at 07:46

AMC

2,642
7
13
35

answered Nov 16 '12 at 09:34

eumiro

207,213
34
299
261

30

Is there any advantage to indexing and copying over dropping? – Robert Muil Jul 31 '15 at 08:15
7

@wes-mckinney could please let me know if dropna () is a better choice over pandas.notnull in this case ? If so, then why ? – stormfield Sep 07 '17 at 11:53
This does not catch line 3 where EPS is 4.3 (valid) and cash is NaN. I expect OP to want to drop that one too. – Cadoiz Jun 08 '20 at 07:52
5

we can also use `df.dropna(subset=['EPS'])` – Mohith7548 Jan 22 '21 at 06:47
2

`dropna` is actually faster if there are multiple columns. – Ka Wa Yip Dec 29 '21 at 08:57

score 1201 · Answer 2 · edited Aug 14 '17 at 00:04

This question is already resolved, but...

...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

Pretty handy!

you can also use `df.dropna(subset = ['column_name'])`. Hope that saves at least one person the extra 5 seconds of 'what am I doing wrong'. Great answer, +1 — James Tobin, Jun 18 '14 at 14:07
@JamesTobin, I just spent 20 minutes to write a function for that! [The official documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) was very cryptic: "Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include". I was unable to understand, what they meant... — Sergey Orshanskiy, Sep 05 '14 at 23:52
```df.dropna(subset = ['column_name'])``` is exactly what I was looking for! Thanks! — amalik2205, Dec 08 '19 at 21:09
This answer is super helpful but in case it isn't clear to anyone reading what options are useful in which situations, I've put together a dropna FAQ post [here](https://stackoverflow.com/a/62444845/4909087). Hope this helps people who are struggling to apply `dropna` to their specific need. — cs95, Jun 18 '20 at 21:19
+1 this answer also seems to help avoid having `SettingWithCopyWarning` later when you use `df.dropna(subset = ['column_name'], inplace=True)` — cookiemonster, Jul 02 '21 at 17:45
@Aman Hi, could you take a look at this question https://stackoverflow.com/questions/70954791/identifying-statistical-outliers-with-pandas-groupby-and-reduce-rows-into-diffe — Aaditya Ura, Feb 02 '22 at 11:29

Joe · Answer 3 · 2021-05-10T17:14:42.857

154

You can use this:

df.dropna(subset=['EPS'], how='all', inplace=True)

edited May 10 '21 at 17:14

answered Aug 02 '17 at 16:28

Joe

12,057
5
39
55

40

`how='all'` is redundant here, because you subsetting dataframe only with one field so both `'all'` and `'any'` will have the same effect. – Anton Protopopov Jan 16 '18 at 12:41
@AntonProtopopov **IMPORTANT:** `how='all'` is NOT redundant. Define a simple dataframe: `df = pd.DataFrame({"a": [10, None], "b": [None, 10]})` Doing `df.dropna(subset=['a', 'b'], how='all')` leaves the dataframe intact (as there aren't rows where both columns are `Nan`, while dropping that parameter returns an empty dataframe. – Enrique Ortiz Casillas Oct 20 '22 at 21:59
@EnriqueOrtizCasillas we were talking about that specific case. In the comment I mentioned that it's only about **one** field. For that `'all'` and `'any'` are the same. In general case it depends on what is your ultimate goal. In your example you are selecting by two columns - that's a different case. – Anton Protopopov Nov 24 '22 at 19:09

score 150 · Answer 4 · answered Apr 23 '14 at 05:37

150

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

import pandas as pd
df = df[pd.notnull(df['EPS'])]

answered Apr 23 '14 at 05:37

Kirk Hadley

1,646
1
10
2

14

Actually, the specific answer would be: `df.dropna(subset=['EPS'])` (based on the general description of Aman, of course this does also work) – joris Apr 23 '14 at 12:53
2

`notnull` is also what Wes (author of Pandas) suggested in his comment on another answer. – fantabolous Jul 09 '14 at 03:24
This maybe a noob question. But when I do a df[pd.notnull(...) or df.dropna the index gets dropped. So if there was a null value in row-index 10 in a df of length 200. The dataframe after running the drop function has index values from 1 to 9 and then 11 to 200. Anyway to "re-index" it – Aakash Gupta Mar 04 '16 at 06:03
you could also do `df[pd.notnull(df[df.columns[INDEX]])]` where `INDEX` would be the numbered column if you don't know name – ocean800 Oct 31 '19 at 20:30
For some reason this answer worked for me and the `df.dropna(subset=['column name']` didnt. – Dr. Mian Jun 24 '20 at 12:42

cs95 · Answer 5 · 2020-07-03T22:07:23.917

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

This is an old question which has been beaten to death but I do believe there is some more useful information to be surfaced on this thread. Read on if you're looking for the answer to any of the following questions:

Can I drop rows if any of its values have NaNs? What about if all of them are NaN?
Can I only look at NaNs in specific columns when dropping rows?
Can I drop rows with a specific count of NaN values?
How do I drop columns instead of rows?
I tried all of the options above but my DataFrame just won't update!

`DataFrame.dropna`: Usage, and Examples

It's already been said that df.dropna is the canonical method to drop NaNs from DataFrames, but there's nothing like a few visual cues to help along the way.

# Setup
df = pd.DataFrame({
    'A': [np.nan, 2, 3, 4],  
    'B': [np.nan, np.nan, 2, 3], 
    'C': [np.nan]*3 + [3]}) 

df                      
     A    B    C
0  NaN  NaN  NaN
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Below is a detail of the most important arguments and how they work, arranged in an FAQ format.

Can I drop rows if any of its values have NaNs? What about if all of them are NaN?

This is where the how=... argument comes in handy. It can be one of

'any' (default) - drops rows if at least one column has NaN
'all' - drops rows only if all of its columns have NaNs

<!_ ->

# Removes all but the last row since there are no NaNs 
df.dropna()

     A    B    C
3  4.0  3.0  3.0

# Removes the first row only
df.dropna(how='all')

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Note
If you just want to see which rows are null (IOW, if you want a boolean mask of rows), use isna:
df.isna()

       A      B      C
0   True   True   True
1  False   True   True
2  False  False   True
3  False  False  False

df.isna().any(axis=1)

0     True
1     True
2     True
3    False
dtype: bool
To get the inversion of this result, use notna instead.

Can I only look at NaNs in specific columns when dropping rows?

This is a use case for the subset=[...] argument.

Specify a list of columns (or indexes with axis=1) to tells pandas you only want to look at these columns (or rows with axis=1) when dropping rows (or columns with axis=1.

# Drop all rows with NaNs in A
df.dropna(subset=['A'])

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

# Drop all rows with NaNs in A OR B
df.dropna(subset=['A', 'B'])

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

Can I drop rows with a specific count of NaN values?

This is a use case for the thresh=... argument. Specify the minimum number of NON-NULL values as an integer.

df.dropna(thresh=1)  

     A    B    C
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

df.dropna(thresh=2)

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

df.dropna(thresh=3)

     A    B    C
3  4.0  3.0  3.0

The thing to note here is you need to specify how many NON-NULL values you want to keep, rather than how many NULL values you want to drop. This is a pain point for new users.

Luckily the fix is easy: if you have a count of NULL values, simply subtract it from the column size to get the correct thresh argument for the function.

required_min_null_values_to_drop = 2 # drop rows with at least 2 NaN
df.dropna(thresh=df.shape[1] - required_min_null_values_to_drop + 1)

     A    B    C
2  3.0  2.0  NaN
3  4.0  3.0  3.0

How do I drop columns instead of rows?

Use the axis=... argument, it can be axis=0 or axis=1.

Tells the function whether you want to drop rows (axis=0) or drop columns (axis=1).

df.dropna()

     A    B    C
3  4.0  3.0  3.0

# All columns have rows, so the result is empty.
df.dropna(axis=1)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

# Here's a different example requiring the column to have all NaN rows
# to be dropped. In this case no columns satisfy the condition.
df.dropna(axis=1, how='all')

     A    B    C
0  NaN  NaN  NaN
1  2.0  NaN  NaN
2  3.0  2.0  NaN
3  4.0  3.0  3.0

# Here's a different example requiring a column to have at least 2 NON-NULL
# values. Column C has less than 2 NON-NULL values, so it should be dropped.
df.dropna(axis=1, thresh=2)

     A    B
0  NaN  NaN
1  2.0  NaN
2  3.0  2.0
3  4.0  3.0

I tried all of the options above but my DataFrame just won't update!

dropna, like most other functions in the pandas API returns a new DataFrame (a copy of the original with changes) as the result, so you should assign it back if you want to see changes.

df.dropna(...) # wrong
df.dropna(..., inplace=True) # right, but not recommended
df = df.dropna(...) # right

Reference

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

DataFrame.dropna(
    self, axis=0, how='any', thresh=None, subset=None, inplace=False)

score 43 · Answer 6 · edited Aug 08 '18 at 15:17

43

Simplest of all solutions:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()

edited Aug 08 '18 at 15:17

ayhan

70,170
20
182
203

answered Nov 23 '17 at 12:08

Gil Baggio

13,019
3
48
37

score 31 · Answer 7 · edited Jan 23 '19 at 10:13

31

Simple and easy way

df.dropna(subset=['EPS'],inplace=True)

source: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

edited Jan 23 '19 at 10:13

answered Jan 22 '19 at 08:26

Noordeen

1,547
20
26

`inplace=True` is a bizarre topic, and has no effect on `DataFrame.dropna()`. See: https://github.com/pandas-dev/pandas/issues/16529 – AMC Feb 16 '20 at 03:56
3

How does this answer differ from @Joe's answer? Also, inplace is will be deprecated eventually, best not to use it at all. – misantroop Mar 28 '20 at 07:28

score 26 · Answer 8 · answered Dec 04 '15 at 07:01

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

score 15 · Answer 9 · answered Apr 20 '17 at 21:15

15

yet another solution which uses the fact that np.nan != np.nan:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

answered Apr 20 '17 at 21:15

MaxU - stand with Ukraine

205,989
36
386
419

score 4 · Answer 10 · edited Feb 10 '20 at 09:19

4

Another version:

df[~df['EPS'].isna()]

edited Feb 10 '20 at 09:19

Georgy

12,464
7
65
73

answered Feb 08 '20 at 07:59

keramat

4,328
6
25
38

1

Why use this over `Series.notna()` ? – AMC Feb 16 '20 at 03:58

score 3 · Answer 11 · edited Jan 26 '17 at 23:12

3

It may be added at that '&' can be used to add additional conditions e.g.

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.

edited Jan 26 '17 at 23:12

aesede

5,541
2
35
33

answered Mar 15 '16 at 15:33

David

39
1

2

Sorry, but OP want someting else. Btw, your code is wrong, return `ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().`. You need add parenthesis - `df = df[(df.EPS > 2.0) & (df.EPS <4.0)]`, but also it is not answer for this question. – jezrael Mar 16 '16 at 11:52

score 3 · Answer 12 · answered Dec 08 '21 at 06:17

3

The following method worked for me. It would help if none of the above methods work:

df[df['colum_name'].str.len() >= 1]

The basic idea is that you pick up the record only if the length strength is greater than 1. This is especially useful if you are dealing with string data

Best!

answered Dec 08 '21 at 06:17

Taie

1,021
16
29

1

This only works for objects columns: AttributeError: Can only use .str accessor with string values! if your columns is float or int – rubengavidia0x Feb 08 '22 at 00:00

Pradeep Singh · Answer 13 · 2020-02-17T11:00:16.597

In datasets having large number of columns its even better to see how many columns contain null values and how many don't.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.

There is Another Question [link](https://stackoverflow.com/q/36226083/8127390) — Pradeep Singh, Dec 14 '19 at 04:40
This question has really been squeezed out of questioning, get it? :) — Moaaz Siddiqui, Sep 28 '21 at 12:46

score 2 · Answer 14 · answered Jul 02 '22 at 06:07

2

You can also use notna inside query:

In [4]: df.query('EPS.notna().values')
Out[4]: 
                 STK_ID.1  EPS  cash
STK_ID RPT_Date                     
600016 20111231    600016  4.3   NaN
601939 20111231    601939  2.5   NaN

answered Jul 02 '22 at 06:07

rachwa

1,805
1
14
17

score -4 · Answer 15 · answered Feb 21 '22 at 19:55

-4

you can try with:

df['EPS'].dropna()

answered Feb 21 '22 at 19:55

Simon

117
2
10

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

15 Answers15