sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Question

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have run

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

I tried using

mat[np.isfinite(mat) == True] = 0

to remove the infinite values but this did not work either. What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?

I am using anaconda and python 2.7.9.

I'm voting to close this, as the author says himself that his data was invalid and though everything pointed to it, he didn't validate -- the data equivalent to a typo, which is a closing reason. — Marcus Müller, Sep 06 '15 at 18:55
I had this same issue with my dataset. Ultimately: a data mistake, not a scikit learn bug. Most of the answers below are helpful but misleading. Check check check your data, make sure that when converted to `float64` it is both finite and not `nan`. The error message is apt - this is almost certainly the issue for anyone who finds themselves here. — Owen, Dec 07 '16 at 13:52
For the record and +1 for @Owen, check your input data and make sure you do not have any missing value in any row or grid. You can use the Imputer class to avoid this problem. — abautista, Jun 20 '18 at 21:29
i have that problem with kaggle's kc_house_data.csv dataset. I am trying to do a linear regression using the variables : ['bedrooms','bathrooms','sqft_living','sqft_lot','floors', 'waterfront','view','grade','sqft_above','sqft_basement', 'lat','sqft_living15'] — robintux, Dec 10 '21 at 02:06

score 171 · Accepted Answer · edited Jun 12 '22 at 19:19

171

This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...

edited Jun 12 '22 at 19:19

Community

1
1

answered Jul 09 '15 at 16:43

Marcus Müller

34,677
4
53
94

7

The docs dont mention anything about this error I need a way of getting rid of the infinite values from my nupy array – Ethan Waldie Jul 09 '15 at 17:19
6

As I said: They are maybe not in your input array. They might occur in the math that happens between input and magical output. The point is that all this math depends on certain conditions for the input. You have to carefully read the docs to find out whether your input satisifies these conditions. – Marcus Müller Jul 10 '15 at 07:54
4

@MarcusMüller could you point me to the location of this document where they specify the requirements of the input matrix? I can't seem to find the "docs" you are referring to. Thank you :) – user2253546 Feb 23 '17 at 21:35

Jun Wang · Answer 2 · 2018-08-12T20:34:06.583

74

I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:

df = df.reset_index()

I encountered this issue many times when I removed some entries in my df, such as

df = df[df.label=='desired_one']

edited Aug 12 '18 at 20:34

answered Dec 24 '17 at 03:43

Jun Wang

879
8
10

9

I love you! That's a rare instance of me finding the right solution despite not knowing what's the cause of the error! – Alexandr Kapshuk Aug 09 '18 at 14:25
9

By doing the df.reset_index() it will add the "index" as a column in the resulting df. Which may not be useful for all scenario. If the df.reset_index(drop=True) ran then it will throw the same error. – smm Sep 18 '18 at 18:19

Boern · Answer 3 · 2023-02-06T10:05:07.797

67

This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):

import pandas as pd
import numpy as np

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
    return df[indices_to_keep].astype(np.float64)

edited Feb 06 '23 at 10:05

answered Oct 05 '17 at 08:30

Boern

7,233
5
55
86

1

Why do you drop the nan two times? First time with `dropna` then a second time when dropping inf. – luca Jun 25 '18 at 09:04
1

I loss some data when I use this function to clean my dataset. Any sugetions why??? – Buddhika Chathuranga Sep 17 '19 at 17:10
9

This is the *only* answer that worked. I tried 20 other answers on SO that did not work. I think this one needs more upvotes. – Contango Jul 05 '20 at 13:23
1

This answer works for me as well. – Sahan Dissanayaka Aug 20 '22 at 06:29
FYI: This approach throws the following warning: `FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only. indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)` – A.Casanova Feb 06 '23 at 09:17
Just replace `.any(1)` by .any(axis=1) to avoid this error, it is mandatory from pandas 1.5 to avoid the previous warning – A.Casanova Feb 06 '23 at 09:49
Changed it, thanks-a-lot for pointing it out AND solving it :) – Boern Feb 06 '23 at 10:05

score 24 · Answer 4 · answered Nov 25 '19 at 01:05

In most cases getting rid of infinite and null values solve this problem.

get rid of infinite values.

df.replace([np.inf, -np.inf], np.nan, inplace=True)

get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values

df.fillna(999, inplace=True)

score 16 · Answer 5 · answered Apr 13 '16 at 15:12

This is the check on which it fails:

https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51

Which says

def _assert_all_finite(X):
    """Like assert_all_finite, but only for ndarray."""
    X = np.asanyarray(X)
    # First try an O(n) time, O(1) space solution for the common case that
    # everything is finite; fall back to O(n) space np.isfinite to prevent
    # false positives from overflow in sum method.
    if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
            and not np.isfinite(X).all()):
        raise ValueError("Input contains NaN, infinity"
                         " or a value too large for %r." % X.dtype)

So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.

score 13 · Answer 6 · answered Jul 14 '15 at 21:09

13

The Dimensions of my input array were skewed, as my input csv had empty spaces.

answered Jul 14 '15 at 21:09

Ethan Waldie

2,799
2
12
14

1

For pandas, I just used `dropna` https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html – FindOutIslamNow Sep 11 '18 at 07:23

score 7 · Answer 7 · answered Aug 10 '17 at 21:13

With this version of python 3:

/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)

Looking at the details of the error, I found the lines of codes causing the failure:

/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     56             and not np.isfinite(X).all()):
     57         raise ValueError("Input contains NaN, infinity"
---> 58                          " or a value too large for %r." % X.dtype)
     59 
     60 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)

Then with a quick and dirty loop, I was able to find that my data indeed contains nans:

print(p[:,0].shape)
index = 0
for i in p[:,0]:
    if not np.isfinite(i):
        print(index, i)
    index +=1

(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...

Now all I have to do is remove the values at these indexes.

score 7 · Answer 8 · answered Jan 09 '21 at 18:48

7

None of the answers here worked for me. This was what worked.

Test_y = np.nan_to_num(Test_y)

It replaces the infinity values with high finite values and the nan values with numbers

answered Jan 09 '21 at 18:48

Emmac

2,928
2
15
22

1

using your suggestion on my x_train and x_test solved the problem for me. – wandesky Nov 01 '21 at 03:42

score 7 · Answer 9 · answered Oct 21 '21 at 18:51

7

Problem seems to occur in DecisionTreeClassifier input check, Try

X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)

answered Oct 21 '21 at 18:51

Mayukh Pankaj

151
2
5

score 5 · Answer 10 · edited Jul 14 '20 at 04:23

5

I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:

X = X.values.astype(np.float)
y = y.values.astype(np.float)

Edit: The originally suggested X.as_matrix() is Deprecated

edited Jul 14 '20 at 04:23

Ali Pardhan

194
1
14

answered Jul 02 '17 at 10:40

tekumara

8,357
10
57
69

Elias Strehle · Answer 11 · 2018-05-07T12:54:08.837

4

I had the error after trying to select a subset of rows:

df = df.reindex(index=my_index)

Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.

edited May 07 '18 at 12:54

answered Feb 15 '18 at 16:07

Elias Strehle

1,722
1
21
34

Renel Chesak · Answer 12 · 2020-12-10T10:52:32.200

Remove all infinite values:

(and replace with min or max for that column)

import numpy as np

# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[       inf,        inf,        inf,        inf,        inf],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [      -inf,       -inf,       -inf,       -inf,       -inf],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]

# go through matrix one column at a time and replace  + and -infinity 
# with the max or min for that column
for i in range(matrix.shape[1]):
    matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
    matrix[:, i][matrix[:, i] == np.inf] = maxs[i]

>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
       [0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
       [0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
       [0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
       [0.90272002, 0.37357483, 0.92952479, 0.072105  , 0.20837798]])

score 2 · Answer 13 · answered Feb 25 '22 at 15:24

2

I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code

df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()

answered Feb 25 '22 at 15:24

Golden Lion

3,840
2
26
35

score 1 · Answer 14 · answered Jun 08 '18 at 12:21

1

i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc

answered Jun 08 '18 at 12:21

Cohen

944
3
13
40

5

This is a dirty fix. There is a reason why your array contains `nan` values; you should find it. – Elias Strehle Jun 25 '18 at 15:31
the data could contain nan and this gives a way to replace it with data with values that he/she finds acceptable – user2867432 Sep 09 '18 at 21:37
Is this adding outliers to the missing data? – ah bon Mar 24 '23 at 01:11

score 1 · Answer 15 · answered Jan 01 '21 at 17:05

1

I would like to propose a solution for numpy that worked well for me. The line

from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max

substitues all infite values of a numpy array with the maximum float64 number.

answered Jan 01 '21 at 17:05

Hagbard

3,430
5
28
64

score 1 · Answer 16 · answered Apr 28 '22 at 18:04

Puff !! In my case the problem was about NaN values...

You can list your columns that had NaN with this function

your_data.isnull().sum()

and then you can fill these NAN values in your dataset file.

Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."

your_data[:] = np.nan_to_num(your_data)

from numpy.nan_to_num

score 1 · Answer 17 · answered May 30 '22 at 15:06

If you happen to use the "kc_house_data.csv" dataset (which some commenters and many data-science newcomers seem to use, because it's presented in lots of popular course material), the data is faulty and the true source for the error.

To fix it, as of 2022:

Delete the last (empty) line in the csv file
There are two lines that contain one empty data value "x,x,,x,x" - to fix it, don't delete the comma, instead add a random integer value like 2000, so it looks like this "x,x,2000,x,x"

Don't forget to save and reload in your project.

All the other answers are helpful and correct, but not in this case:

If you use kc_house_data.csv you need to fix the data in the file, nothing else will help, the empty data field will shift the other data around randomly and generate weird bugs that are hard to track to the source!

score 0 · Answer 18 · answered Jun 25 '18 at 09:24

0

In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.

answered Jun 25 '18 at 09:24

luca

7,178
7
41
55

score 0 · Answer 19 · answered Jan 21 '21 at 09:44

0

dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

This worked for me

answered Jan 21 '21 at 09:44

Parthiban

527
2
13
24

Chris Cooper · Answer 20 · 2022-03-20T21:44:34.687

0

If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.

Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.

edited Mar 20 '22 at 21:44

answered Aug 19 '21 at 19:24

Chris Cooper

17,276
9
52
70

score 0 · Answer 21 · answered Dec 13 '21 at 15:29

0

I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.

answered Dec 13 '21 at 15:29

Goel Nimi

37
7

score 0 · Answer 22 · answered Jan 02 '22 at 17:05

0

Using isneginf may help. http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf

x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with

answered Jan 02 '22 at 17:05

Joyanta J. Mondal

888
1
8
20

Tomasz Bartkowiak · Answer 23 · 2022-01-27T19:26:40.387

Note: This solution only applies if you consciously want to keep NaN entries in your dataset.

This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.

Solution

My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet

import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
    # your code that raises ValueError

this will replace the _assert_all_finite by a dummy mock function so it won't get executed.

Please note that patching is not a recommended practice and might result in unpredictable behaviour!

EDIT: This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)

score 0 · Answer 24 · answered May 18 '22 at 16:25

After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.

TiTo · Answer 25 · 2022-10-12T08:16:51.807

In my case the algorithm required data to be between (0,1) noninclusive. My quite brutal solutions was to add a small random number to all desired values:

y_train = pd.DataFrame(y_train).applymap(lambda x: x + np.random.rand()/100000.0)["col_name"]
y_train[y_train >= 1] = 0.999999

while y_train is in the range of [0,1].

This is definitely not suitable for all cases, as you are messing with your input data but can be a solution if you have sparse data and only need a quick forecast

Yi Tang · Answer 26 · 2023-04-03T21:14:46.033

sklearn=1.1.2

python=3.9

In my case the PowerScaler with standardize=True is causing the problem. As @TomaszBartkowiak already explained, the assertion is raised in sklearn.utils.validation._asser_all_finite which seems to be used in many places before aggregations like np.sum in my case.

I check all the conditions manually (dtypes, nan, inf, -inf) and found that no reason why the assertion still fails. So i simply temporarily comment out the check in _asser_all_finit line 108:

...
is_float = X.dtype.kind in "fc"
if True:#is_float and (np.isfinite(_safe_accumulator_op(np.sum, X))):
    pass
elif is_float:
...

After the successful execution of PowerScaler i change the code back. This is quick and dirty, but if you are really confident with your data and this happens seeming for no reason, you can solve it this way. But in general speaking the probability is very high that the data does contains INF/-INF somewhere. So better dig deeper. In case of Scaler you can easily find the columns with INF/-INF in output, so that you can go back and check these columns again in the input data. I don't know why though why the checks didn't work in the first place before using the Scaler ...

score -1 · Answer 27 · answered Mar 14 '19 at 08:22

try

mat.sum()

If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.

see the _assert_all_finite function in validation.py from the scikit source code:

if is_float and np.isfinite(X.sum()):
    pass
elif is_float:
    msg_err = "Input contains {} or a value too large for {!r}."
    if (allow_nan and np.isinf(X).any() or
            not allow_nan and not np.isfinite(X).all()):
        type_err = 'infinity' if allow_nan else 'NaN, infinity'
        # print(X.sum())
        raise ValueError(msg_err.format(type_err, X.dtype))

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

27 Answers27

Remove all infinite values:

(and replace with min or max for that column)

Solution

Linked

Related