Scikit-learn: error in fitting model - Input contains NaN, infinity or a value too large for float64

Question

My question appears to be the same as previous posts (post-1, post-2, and post-3). I did follow their solutions and still got the same errors. So, I am posting it here to seek for advices.

I am using the basic functions from sklearn. The raw data contain missing value, so I use Imputer to fill the medians. Then, I use LabelEncoder to convert the nominal features from numerical ones. After that, I use StandardScaler to normalize the data set.

The problem is at the LinearRegression stage. I got ''ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'' But I indeed check the dataset and there is no NaN, infinity or value_too_large...

Really have no idea why this error comes. Please feel free to comment if you have any clues. Thank you!

The code I am using is:

import csv
import numpy as np
from sklearn import preprocessing 
from sklearn import linear_model
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

out_file = 'raw.dat'      
dataset = np.loadtxt(out_file, delimiter=',')
data = dataset[:, 0:-1]   # select columns 0 through -1
target = dataset[:, -1]   # select the last column

# to handle missing values
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
imp.fit(data)
data_imp = imp.transform(data)

# label Encoder: converting nominal features
le = preprocessing.LabelEncoder()
le.fit(data_imp[:, 2])
print le.classes_
le.transform(data_imp[:, 2])

le.fit(data_imp[:, 3])
print le.classes_
le.transform(data_imp[:, 3])

print '# of data: ', len(target)

scaler = preprocessing.StandardScaler().fit_transform(data_imp)
scaler = scaler.astype(np.float64, copy=False)

np.savetxt("newdata2.csv", scaler, delimiter=",")
ols = linear_model.LinearRegression()
for x in xrange(2, len(scaler)):
    print x
    scaler = scaler[:x, 1:]
    print scaler
    print np.isnan(scaler.any()) # False
    print np.any(np.isnan(scaler)) # False

    print np.isfinite(scaler.all()) # True
    print np.all(np.isfinite(scaler)) # True

    ols.fit(scaler, target)
    print ols

The error msg is shown as below.

Traceback (most recent call last):
  File ".\data_export.py", line 123, in prep
   ols.fit(scaler, target)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", 
    line 427, in fit y_numeric=True, multi_output=True)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", 
    line 513, in check_X_y dtype=None)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py",   
    line 398, in check_array _assert_all_finite(array)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", 
    line 54, in _assert_all_finite" or a value too large for %r." % X.dtype)
  ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The raw data (raw.dat) is partly shown below:

1, 2.0, 14002, 1, 1965, 1, 1, 2, NaN, 771, 648.0, 4800.0
2, 2.8, 14002, 2, 1924, 3, 1, 4, NaN, 1400, 714.0, 999.0
3, 2.1, 14002, 1, 1965, 1, 1, 2, NaN, 725, 675.0, 4000.0
4, 1.6, 14002, 2, 1914, 2, 1, 3, 1, 1530, 620.0, 9950.0
5, 8.9, 14010, 1, 1973, 2, 1, 3, NaN, 1048, 705.0, 9000.0
6, 7.3, 14010, 1, 1982, 1, 1, 2, 1, 880, 656.0, 5000.0
......

After we fixed the missing value and normalize the numbers, the data from newdata2.csv are shown like the following:

-1.70   -2.23   -1.64   -1.15   -0.40   -1.80   -0.86   -1.78   0.05    -1.35   0.37
-1.70   -2.14   -1.64   0.28    -2.54   0.36    -0.86   -0.56   0.05    0.21    0.75
-1.70   -2.22   -1.64   -1.15   -0.40   -1.80   -0.86   -1.78   0.05    -1.46   0.52
-1.70   -2.28   -1.64   0.28    -3.06   -0.72   -0.86   -1.17   0.05    0.53    0.20
-1.70   -1.43   -1.62   -1.15   0.01    -0.72   -0.86   -1.17   0.05    -0.66   0.69
....

score 0 · Answer 1 · answered Aug 19 '16 at 23:46

0

You have NaN values in your raw.dat file. Drop that column or replace it with 0s.

answered Aug 19 '16 at 23:46

Bedi Egilmez

1,494
1
18
26

Thanks! Actually, I did fix the missing values. You can see the newdata2.csv file. The NaN are replaced and normalized... – Neverfaraway Aug 20 '16 at 00:16
No, the problem was still there. It is the same error, after I fixed the missing values. – Neverfaraway Aug 21 '16 at 03:51
Do not normalize your data. Just replace `NaN`s with zeros and run your regression. – Bedi Egilmez Aug 21 '16 at 04:20

Scikit-learn: error in fitting model - Input contains NaN, infinity or a value too large for float64

1 Answers1