3

I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

A few values from the dataframe are as follows:

3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322

I am assuming that I am dealing with value underflow problem. How can I deal with this problem.

This is for python 3.6 and pandas dataframe.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The values in the dataframe should be standardized as needed but getting error due to value underflow.

Coddy
  • 549
  • 4
  • 18
  • 2
    I wonder what something to the `10^{-320}` represents. Very few things in the universe might get to that smallness – rafaelc Aug 07 '19 at 18:03
  • So I am unwrapping an array which looks something like `array([[[-7.45058060e-09, 3.33333329e-01, 1.00000000e+00],` `[-1.00000000e+00, -3.33333329e-01, 6.27143372e-310],` `[3.31023983e-322, 1.35335972e-315, 6.42285340e-322]]],` `dtype=float128) ` with one column for each value. – Coddy Aug 07 '19 at 18:05
  • If you're dealing with stuff in the range `e-9` to `e-1`, then definitely `3e^-322` is zero – rafaelc Aug 07 '19 at 18:09
  • I second rafaelc. Just out of curiosity, what could this number possibly represent? – tnknepp Aug 07 '19 at 18:13
  • @rafaelc If I only deal with `e^-322` would that be of any help? – Coddy Aug 07 '19 at 18:14
  • these values are for atoms – Coddy Aug 07 '19 at 18:15
  • @Coddy Ok, everything I know about atoms (mass, radius, etc.) are a few hundred orders of magnitude greater than E-320. I think rafaelc's point is that in the array you provided the E-322 values are all zero. The number's you are talking about are roughly 1E300 times smaller than the expected size of strings in string theory. Your numbers are FAR too small to represent anything physical. – tnknepp Aug 07 '19 at 18:21
  • 1
    @Coddy If *all* your data is `something times e^-320`, then just drop the `e^320`. For standard scaling, it doesn't matter the magnitude, only your distribution – rafaelc Aug 07 '19 at 18:25
  • @rafaelc I do not know the specific application that Coddy is dealing with, but these numbers are too small to represent anything physical. I recommend looking for errors in his processing and do a unit check. This seemss to be a potentially good example of being able to identify when your data is wrong. – tnknepp Aug 07 '19 at 18:27
  • To be exact they are simulation feature values of crystal diffraction patterns. – Coddy Aug 07 '19 at 18:28
  • 1
    @Coddy That's a little out of my wheelhouse (PhD in analytical chemistry), but for something that small you are a few hundred orders of magnitude below the Heisenberg uncertainty limits. I just don't see how we can measure or reliably calculate anything that small. I apologize for not making suggesions regarding your specific question, but I suggest again to check for errors in your processing and in your units. Anything E-300 is indistinguishable from zero. – tnknepp Aug 07 '19 at 18:34
  • 1
    @rafaelc Yeah now you said that I am thinking it might be the case. I am a regular CS guy playing around with datasets. I will have a look if these values went haywire a few preprocessing steps ago. Thanks a lot :) – Coddy Aug 07 '19 at 18:49

3 Answers3

0

Multiply them.

You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.

As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.

Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.

ASGM
  • 11,051
  • 1
  • 32
  • 53
  • Just a caveat, you'll lose precision anyway even when multiplying. If you have `a = 1.2345` and `C = 1e-322`, then `a * C / C` is not equal to `a` – rafaelc Aug 07 '19 at 18:44
  • That's an important point (though on my computer `a * C/C == a` evaluates to `True`). But it depends on how OP is storing them in the first place, right? If the values strings in some CSV file and they're doing a numeric conversion, then manipulating the strings as strings (by changing the exponent, for example, before numeric conversion) would fix the issue. – ASGM Aug 07 '19 at 18:52
0

Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs

-1

You should try normalization of your data to bring it within some scale of value. Here is the sample code

from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html You are receiving NAN because the numbers went off your handling scale.

EDIT1: Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those values using .drop function like below: DataFrame.drop()

Jaskaran Singh
  • 531
  • 3
  • 14