sklearn's yeo-johnson PowerTransformer throws "ValueError: Input contains infinity" when data has no large/inf/nan values

Question

Yeo-Johnson method in PowerTransformer in sklearn (0.21.3; python 3.6) throws an error

ValueError: Input contains infinity or a value too large for dtype('float64').

even when data is perfectly valid. Am I overlooking something? Or is this is a bug?

Code to reproduce:

import sklearn
from sklearn.preprocessing import PowerTransformer
import numpy as np
import pandas as pd

print(f"sklearn version = {sklearn.__version__}")

data = np.array([1000]*100 + [980]).reshape(-1, 1)
print(f"Data stats:\n{pd.DataFrame(data).describe()}")

## Powertransform. It will give an error: "Input contains infinity or a value too large for dtype('float64')"
pt = PowerTransformer(method="yeo-johnson")
pt.fit(data)

Output I get:

sklearn version = 0.21.3
Data stats:
                 0
count   101.000000
mean    999.801980
std       1.990074
min     980.000000
25%    1000.000000
50%    1000.000000
75%    1000.000000
max    1000.000000
/home/jupyter/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py:2828: RuntimeWarning:

overflow encountered in power

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-e81214808bec> in <module>()
      8 ## Powertransform. It will give ""
      9 pt = PowerTransformer(method="yeo-johnson")
---> 10 pt.fit(data)

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in fit(self, X, y)
   2672         self : object
   2673         """
-> 2674         self._fit(X, y=y, force_transform=False)
   2675         return self
   2676 

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in _fit(self, X, y, force_transform)
   2703                 X = self._scaler.fit_transform(X)
   2704             else:
-> 2705                 self._scaler.fit(X)
   2706 
   2707         return X

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in fit(self, X, y)
    637         # Reset internal state before fitting
    638         self._reset()
--> 639         return self.partial_fit(X, y)
    640 
    641     def partial_fit(self, X, y=None):

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in partial_fit(self, X, y)
    661         X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
    662                         estimator=self, dtype=FLOAT_DTYPES,
--> 663                         force_all_finite='allow-nan')
    664 
    665         # Even in the case of `with_mean=False`, we update the mean anyway

~/.local/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

~/.local/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains infinity or a value too large for dtype('float64').

I have seen other posts here and here which has inf values. In this case, there is no value greater than 1000.

afsharov · Accepted Answer · 2021-06-12T09:42:28.720

This is not a bug but because of the internals of PowerTransformer. Have a look at these lines of your error stack trace:

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in _fit(self, X, y, force_transform)
   2703                 X = self._scaler.fit_transform(X)
   2704             else:
-> 2705                 self._scaler.fit(X)
   2706 
   2707         return X

The standardize parameter of PowerTransformer is by default set to true. In this case, the provided data will already be transformed during the call of fit and then the transformed data will be scaled by a StandardScaler (see in the source code here).

The problem now is that your transformed data will turn out to be an array of inf values. You can confirm this by obtaining the lambda of the Yeo-Johnson transformation of your data with the corresponding yeojohnson method of scipy and check the transformation:

from scipy.stats import yeojohnson
import numpy as np


data = np.array([1000]*100 + [980])

_, lmbda = yeojohnson(data)
print(lmbda)  # 291.47777013

data_t = (np.power(data + 1, lmbda) - 1) / lmbda

data_t is the result of the Yeo-Johnson transformation and only contains inf values. This is now passed to the Standardscaler and complains that its "input" indeed contains inf values. So it is not complaining about your original data, but the transformed one.

You can avoid this behavior by setting standardize=False and it will work fine:

from sklearn.preprocessing import PowerTransformer
import numpy as np


data = np.array([1000]*100 + [980]).reshape(-1, 1)

pt = PowerTransformer(method="yeo-johnson", standardize=False)
data_t = pt.fit_transform(data)

However, along with a RunTimeWarning, you will still get an array full of inf values which might not be useful at all. But this is not because of some bug but the actual result of the transformation.

Thanks @afsharov . That explains why. The `lambda` seems too high for this data. Setting `standardize=False` is of no use, like you mentioned. — Vinay Kolar, Jun 16 '21 at 00:45
I think, one solution could be that you apply `MinMaxScaler()` before applying power transformation. MinMaxScaler does not change data distribution but makes sure that it is between 0 and 1. This will likely reduce the changes of getting very large (infinity) values as a result of power operation in PowerTransformer. — Ather Cheema, Apr 27 '22 at 01:37

sklearn's yeo-johnson PowerTransformer throws "ValueError: Input contains infinity" when data has no large/inf/nan values

1 Answers1