3

I have the following code

X = df_X.as_matrix(header[1:col_num])
scaler = preprocessing.StandardScaler().fit(X)
X_nor = scaler.transform(X) 

And got the following errors:

  File "/Users/edamame/Library/python_virenv/lib/python2.7/site-packages/sklearn/utils/validation.py", line 54, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I used:

print(np.isinf(X))
print(np.isnan(X))

which gives me the output below. This couldn't really tell me which element has issue as I have millions of rows.

[[False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]
 ..., 
 [False False False ..., False False False]
 [False False False ..., False False False]
 [False False False ..., False False False]]

Is there a way to identify which value in the matrix X actually cause the problem? How do people avoid it in general?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Edamame
  • 23,718
  • 73
  • 186
  • 320

2 Answers2

6

numpy contains various logical element-wise tests for this sort of thing.

In your particular case, you will want to use isinf and isnan.

In response to your edit:

You can pass the result of np.isinf() or np.isnan() to np.where(), which will return the indices where a condition is true. Here's a quick example:

import numpy as np

test = np.array([0.1, 0.3, float("Inf"), 0.2])

bad_indices = np.where(np.isinf(test))

print(bad_indices)

You can then use those indices to replace the content of the array:

test[bad_indices] = -1

Thomite
  • 741
  • 4
  • 7
  • thanks. Please see my modified question above, I need to find the specific bad values out of millions of records ... is there a good approach? – Edamame Apr 10 '16 at 17:01
  • I actually got: ('bad_indices ', (array([], dtype=int64), array([], dtype=int64))) from print('bad_indices ', np.where(np.isinf(X))) ... didn't return any index ... – Edamame Apr 11 '16 at 00:03
  • 1
    It returned an empty array of indices, which means there are no infinite values in the array. Try isnan() instead - the error message indicates it could be one or the other. – Thomite Apr 11 '16 at 11:32
  • found it with np.isnan(). Thanks! – Edamame Apr 11 '16 at 16:30
1

"How do people avoid it in general?"

Real example:

data360 = pd.read_csv(r'C:...')

s = StandardScaler()
data360 = s.fit_transform(data360)

print(np.where(np.isnan(data360)))

Output:

(array([ 130, 161, 889, ..., 1884216, 1884276, 1884550], dtype=int64), array([1, 1, 1, ..., 1, 1, 1], dtype=int64))

You may, out of curiosity, check whether this is true or not by finding one of the rows in question (I checked row 132 in my csv file, which corresponds to 130 in my array): 1010, 131, 0.115462015, nan, 0.291065837, 0.083311105, 8, 2, 2

One way to "fix" the issue: df_new = data360[np.isfinite(data360).all(1)]

This returns the same data frame, without the rows containing NaN.

Checking the len() prior and after processing reveals that the data set has now been reduced (in my case) from 1884600 to 1870298.

Edit: you need to evaluate the data that you are in possession of and what you will use it for, before you just remove all rows containing NaN.

halfelf
  • 9,737
  • 13
  • 54
  • 63
零審議
  • 11
  • 4