Python pandas: how to remove nan and -inf values

Question

I have the following dataframe

           time       X    Y  X_t0     X_tp0  X_t1     X_tp1  X_t2     X_tp2
0         0.002876    0   10     0       NaN   NaN       NaN   NaN       NaN
1         0.002986    0   10     0       NaN     0       NaN   NaN       NaN
2         0.037367    1   10     1  1.000000     0       NaN     0       NaN
3         0.037374    2   10     2  0.500000     1  1.000000     0       NaN
4         0.037389    3   10     3  0.333333     2  0.500000     1  1.000000
5         0.037393    4   10     4  0.250000     3  0.333333     2  0.500000

....
1030308   9.962213  256  268   256  0.000000   256  0.003906   255  0.003922
1030309  10.041799    0  268     0      -inf   256  0.000000   256  0.003906
1030310  10.118960    0  268     0       NaN     0      -inf   256  0.000000

I tried with the following

df.dropna(inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)

X_train = X_train.drop('time', axis=1)
X_train = X_train.drop('X_t1', axis=1)
X_train = X_train.drop('X_t2', axis=1)
X_test = X_test.drop('time', axis=1)
X_test = X_test.drop('X_t1', axis=1)
X_test = X_test.drop('X_t2', axis=1)
X_test.fillna(X_test.mean(), inplace=True)
X_train.fillna(X_train.mean(), inplace=True)
y_train.fillna(y_train.mean(), inplace=True)

However, I am still getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). whenever i try to fit a regression model fit(X_train, y_train)

How can we remove both the NaN and -inf values at the same time?

Do you want to remove the rows with `NaN` and `-inf` or set them to default values? — AChampion, Aug 17 '17 at 21:19
Replace `-inf` with `NaN` (`df.replace(-np.inf, np.nan)`) then do the `dropna()`. — AChampion, Aug 17 '17 at 21:22
Thank you but i still am getting the same error when i try to fit a regression model `fit(X_train, y_train)` — , Aug 17 '17 at 21:31
@AChampion, when i do `df.replace(-np.inf, np.nan)`, it converts the `-inf` values to `NaN`. However, when we do `df.dropna(inplace=True)` - it doesn't remove ALL `NaN` values - it leaves some rows with `NaN` values out and that's why i am still getting the same error. Is it possible to force to remove ALL rows with `NaN` values? — , Aug 17 '17 at 22:14
Possible duplicate of [dropping infinite values from dataframes in pandas?](https://stackoverflow.com/questions/17477979/dropping-infinite-values-from-dataframes-in-pandas) — DJK, Aug 17 '17 at 22:16
@djk47463, it is NOT a duplicate please. I have seen that before I post my question. If you can't give a helping hand, don't abuse the power of stackoverflow please :( — , Aug 17 '17 at 22:19
@user2154227 That's not abuse of SO, you asked the exact same question, you notice it also says "Possible duplicate" meaning that has to be reviewed — DJK, Aug 17 '17 at 22:23

score 91 · Accepted Answer · answered Aug 17 '17 at 21:24

91

Use pd.DataFrame.isin and check for rows that have any with pd.DataFrame.any. Finally, use the boolean array to slice the dataframe.

df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]

             time    X    Y  X_t0     X_tp0   X_t1     X_tp1   X_t2     X_tp2
4        0.037389    3   10     3  0.333333    2.0  0.500000    1.0  1.000000
5        0.037393    4   10     4  0.250000    3.0  0.333333    2.0  0.500000
1030308  9.962213  256  268   256  0.000000  256.0  0.003906  255.0  0.003922

answered Aug 17 '17 at 21:24

piRSquared

285,575
57
475
624

Thank you but i still am getting the same error when i try to fit a regression model `fit(X_train, y_train)` – Aug 17 '17 at 21:31
Do you have any values to large for dtype `float32`? – piRSquared Aug 17 '17 at 21:34
How do we check that? I have around 2 million rows and seems to be difficult to check it manually – Aug 17 '17 at 21:38
5

`df[~df.isin([np.nan, np.inf, -np.inf]).any(1)].astype(np.float64)`? – Alexander Aug 17 '17 at 21:38

Alexander · Answer 2 · 2017-08-17T22:31:20.407

34

You can replace inf and -inf with NaN, and then select non-null rows.

df[df.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)]  # .astype(np.float64) ?

or

df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)

Check the type of your columns returns to make sure they are all as expected (e.g. np.float32/64) via df.info().

edited Aug 17 '17 at 22:31

answered Aug 17 '17 at 21:42

Alexander

105,104
32
201
196

I am still getting the same error message. When i do `df.info()` here is the output`Data columns (total 9 columns): time 1030291 non-null float64 X 1030291 non-null int64 Y 1030291 non-null int64 X_t0 1030291 non-null int64 X_tp0 1030291 non-null float64 X_t1 1030291 non-null float64 X_tp1 1030291 non-null float64 X_t2 1030291 non-null float64 X_tp2 1030291 non-null float64 dtypes: float64(6), int64(3) memory usage: 78.6 MB` – Aug 17 '17 at 21:53
Some of those columns are integers. I'm not sure if it would help, but try converting everything to floats via `.astype(np.float64)`. Failing that, try `df.describe()` to check for max or min values that look out of line. – Alexander Aug 17 '17 at 21:55
OK, I will Alexander. when i do `df.replace(-np.inf, np.nan)`, it converts the `-inf` values to `NaN`. However, when we do `df.dropna(inplace=True)` - it doesn't remove ALL `NaN` values - it leaves some rows with `NaN` values out and that's why i am still getting the same error. Is it possible to force to remove ALL rows with `NaN` values? – Aug 17 '17 at 22:17
You need to specify the axis equal to `1` to drop rows, otherwise it is dropping columns: `df.dropna(axis=1)`. Also, see this: https://stackoverflow.com/questions/17477979/dropping-infinite-values-from-dataframes-in-pandas/35783766#35783766 – Alexander Aug 17 '17 at 22:20

score 21 · Answer 3 · edited Jan 22 '19 at 19:14

21

df.replace([np.inf, -np.inf], np.nan)

df.dropna(inplace=True)

edited Jan 22 '19 at 19:14

Zoe

27,060
21
118
148

answered Jan 22 '19 at 18:59

Maria Wollestonecraft

466
5
6

1

I like this answer, but I think you'd need: `df.replace([np.inf, -np.inf], np.nan, inplace=True)` – Alex Oct 08 '21 at 16:39
This answer is the one that worked for me. – ali14 Aug 04 '22 at 19:56

DougR · Answer 4 · 2022-03-10T20:40:10.697

Instead of dropping rows which contain any nulls and infinite numbers, it is more succinct to the reverse the logic of that and instead return the rows where all cells are finite numbers. The numpy isfinite function does this and the '.all(1)' will only return a TRUE if all cells in row are finite.

df = df[np.isfinite(df).all(1)]

Edit: If you have some non-numerical dtypes in your dataframe, you might want to isolate the float dtype columns of interest. See example below.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
df['E'] = 'a_string'
df.at[2,'D'] = np.nan
df.at[1,'B'] = np.inf
df['A'] = df['A'].astype(np.float16)
df
          A         B         C         D         E
0  0.325195  0.199801  0.175851  0.989883  a_string
1  0.040192       inf  0.296379  0.632059  a_string
2  0.348877  0.369374  0.976187       NaN  a_string

floating_columns = df.select_dtypes(include=[np.floating]).columns
subset_df = df[floating_columns]
df = df[np.isfinite(subset_df).all(1)]
df
          A         B         C         D         E
0  0.381104  0.119991  0.388697  0.235735  a_string

Unfortunately this doesn't work for me. The error I get is `TypeError: Object with dtype category cannot perform the numpy op isfinite` — Antti, Mar 10 '22 at 13:37
@Antti I've edited the answer to include your use-case. But with mixed dtypes, the top answer would probably be your best bet. — DougR, Mar 10 '22 at 20:42

score 4 · Answer 5 · answered Feb 11 '20 at 11:33

4

I prefer to set the options so that inf values are calculated to nan;

s1 = pd.Series([0, 1, 2])
s2 = pd.Series([2, 1, 0])
s1/s2
# Outputs:
# 0.0
# 1.0
# inf
# dtype: float64

pd.set_option('mode.use_inf_as_na', True)
s1/s2
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64

Note you can also use context;

with pd.option_context('mode.use_inf_as_na', True):
    print(s1/s2)
# Outputs:
# 0.0
# 1.0
# NaN
# dtype: float64

answered Feb 11 '20 at 11:33

mrkbutty

489
1
5
13

I tried all the mentioned solutions here. But i still have nan and inf values in my dataframe. Can anyone help? Also my values in the dataframe are exponential float values how can i convert that to small float values? – Sep 26 '20 at 08:16

score 0 · Answer 6 · edited Jan 25 '20 at 06:09

0

df.replace only replaces the first occurrence on the value and thus the error

df = list(filter(lambda x: x!= inf, df)) would remove all occurrences of inf and then the drop function can be used

edited Jan 25 '20 at 06:09

Michel_T.

2,741
5
21
31

answered Jan 25 '20 at 04:16

Sanjeev Mishra

1

Python pandas: how to remove nan and -inf values

6 Answers6

Linked