2

After applying KernelPCA to my data and passing it to a classifier (SVC) I'm getting the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

and this warning while performing KernelPCA:

RuntimeWarning: invalid value encountered in sqrt X_transformed = self.alphas_ * np.sqrt(self.lambdas_)

Looking at the transformed data I've found several nan values.

It makes no difference which kernel I'm using. I tried cosine, rbf and linear.

But what's interesting:

  • My original data only contains values between 0 and 1 (no inf or nan), it's scaled with MinMaxScaler

  • Applying standard PCA works, which I thought to be the same as KernelPCA with linear kernel.

Some more facts:

  • My data is high dimensional ( > 8000 features) and mostly sparse.
  • I'm using the newest version of scikit-learn, 18.2

Any idea how to overcome this and what could be the reason?

ScientiaEtVeritas
  • 5,158
  • 4
  • 41
  • 59
  • Are you getting any warnings during kernelPCA fit or transform? Maybe [this is related](https://github.com/scikit-learn/scikit-learn/pull/8531). – Vivek Kumar Jun 23 '17 at 08:46
  • @VivekKumar: You're right, there is a warning: ``RuntimeWarning: invalid value encountered in sqrt X_transformed = self.alphas_ * np.sqrt(self.lambdas_)`` – ScientiaEtVeritas Jun 23 '17 at 09:02
  • You should try and pinpoint the a smaller subset of your data, where this warning occurs and post it here along with your code. Also try updating your scikit-learn version to be same as the master branch which I given a link in previous comment, to see if the error still is present or not. – Vivek Kumar Jun 23 '17 at 09:05

1 Answers1

3

The NaNs are produced because the eigenvalues (self.lambdas_) of the input matrix are negative which provoke the ValueError as the square root does not operate with negative values.

The issue might be overcome by setting KernelPCA(remove_zero_eig=True, ...) but such action would not preserve the original dimensionality of the data. Using this parameter is a last resort as the model's results may be skewed.

Actually, it has been stated negative eigenvalues indicate a model misspecification, which is obviously bad. Possible solution for evading that fact without corroding the dimensionality of the data with remove_zero_eig parameter might be reducing the quantity of the original features, which are greatly correlated. Try to build the correlation matrix and see what those values are. Then, try to omit the redundant features and fit the KernelPCA() again.

E.Z
  • 1,958
  • 1
  • 18
  • 27
  • Thank you for your answer :) You are right,``remove_zero_eig=True`` makes the prediction score worse, but building a correlation matrix seems impractical for more than 8.000 features. My intention to use ``KernelPCA`` is exactly what you describe, to reduce the features and combine features that are greatly correlated. Are there any alternatives, maybe with ``sklearn`` or ``pandas`` that can automate this step? My dataset tends to have features with the same value on all rows... – ScientiaEtVeritas Jun 26 '17 at 09:16
  • Wow, that is uncomfortable. You could check [here](https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on) on how to operate with correlated columns of a *Pandas DataFrame* or use `DataFrame.drop_duplicates`. You should firstly operate with the rows that are absolutely the same (if there are any) with the threshold equal to 1. – E.Z Jun 26 '17 at 10:13