My question appears to be the same as previous posts (post-1, post-2, and post-3). I did follow their solutions and still got the same errors. So, I am posting it here to seek for advices.
I am using the basic functions from sklearn. The raw data contain missing value, so I use Imputer to fill the medians. Then, I use LabelEncoder to convert the nominal features from numerical ones. After that, I use StandardScaler to normalize the data set.
The problem is at the LinearRegression stage. I got ''ValueError: Input contains NaN, infinity or a value too large for dtype('float64').'' But I indeed check the dataset and there is no NaN, infinity or value_too_large...
Really have no idea why this error comes. Please feel free to comment if you have any clues. Thank you!
The code I am using is:
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
out_file = 'raw.dat'
dataset = np.loadtxt(out_file, delimiter=',')
data = dataset[:, 0:-1] # select columns 0 through -1
target = dataset[:, -1] # select the last column
# to handle missing values
imp = Imputer(missing_values='NaN', strategy='median', axis=0)
imp.fit(data)
data_imp = imp.transform(data)
# label Encoder: converting nominal features
le = preprocessing.LabelEncoder()
le.fit(data_imp[:, 2])
print le.classes_
le.transform(data_imp[:, 2])
le.fit(data_imp[:, 3])
print le.classes_
le.transform(data_imp[:, 3])
print '# of data: ', len(target)
scaler = preprocessing.StandardScaler().fit_transform(data_imp)
scaler = scaler.astype(np.float64, copy=False)
np.savetxt("newdata2.csv", scaler, delimiter=",")
ols = linear_model.LinearRegression()
for x in xrange(2, len(scaler)):
print x
scaler = scaler[:x, 1:]
print scaler
print np.isnan(scaler.any()) # False
print np.any(np.isnan(scaler)) # False
print np.isfinite(scaler.all()) # True
print np.all(np.isfinite(scaler)) # True
ols.fit(scaler, target)
print ols
The error msg is shown as below.
Traceback (most recent call last):
File ".\data_export.py", line 123, in prep
ols.fit(scaler, target)
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py",
line 427, in fit y_numeric=True, multi_output=True)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py",
line 513, in check_X_y dtype=None)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py",
line 398, in check_array _assert_all_finite(array)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py",
line 54, in _assert_all_finite" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The raw data (raw.dat) is partly shown below:
1, 2.0, 14002, 1, 1965, 1, 1, 2, NaN, 771, 648.0, 4800.0
2, 2.8, 14002, 2, 1924, 3, 1, 4, NaN, 1400, 714.0, 999.0
3, 2.1, 14002, 1, 1965, 1, 1, 2, NaN, 725, 675.0, 4000.0
4, 1.6, 14002, 2, 1914, 2, 1, 3, 1, 1530, 620.0, 9950.0
5, 8.9, 14010, 1, 1973, 2, 1, 3, NaN, 1048, 705.0, 9000.0
6, 7.3, 14010, 1, 1982, 1, 1, 2, 1, 880, 656.0, 5000.0
......
After we fixed the missing value and normalize the numbers, the data from newdata2.csv are shown like the following:
-1.70 -2.23 -1.64 -1.15 -0.40 -1.80 -0.86 -1.78 0.05 -1.35 0.37
-1.70 -2.14 -1.64 0.28 -2.54 0.36 -0.86 -0.56 0.05 0.21 0.75
-1.70 -2.22 -1.64 -1.15 -0.40 -1.80 -0.86 -1.78 0.05 -1.46 0.52
-1.70 -2.28 -1.64 0.28 -3.06 -0.72 -0.86 -1.17 0.05 0.53 0.20
-1.70 -1.43 -1.62 -1.15 0.01 -0.72 -0.86 -1.17 0.05 -0.66 0.69
....