How to deal with really small (order of -322) floating values in pandas dataframe?

Question

I have a pandas dataframe with feature values that are, really really small, of the order -322. I am trying to standardize the features but getting

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

A few values from the dataframe are as follows:

3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322
3.962406e-321
3.310240e-322

I am assuming that I am dealing with value underflow problem. How can I deal with this problem.

This is for python 3.6 and pandas dataframe.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The values in the dataframe should be standardized as needed but getting error due to value underflow.

I wonder what something to the `10^{-320}` represents. Very few things in the universe might get to that smallness — rafaelc, Aug 07 '19 at 18:03
So I am unwrapping an array which looks something like `array([[[-7.45058060e-09, 3.33333329e-01, 1.00000000e+00],` `[-1.00000000e+00, -3.33333329e-01, 6.27143372e-310],` `[3.31023983e-322, 1.35335972e-315, 6.42285340e-322]]],` `dtype=float128) ` with one column for each value. — Coddy, Aug 07 '19 at 18:05
If you're dealing with stuff in the range `e-9` to `e-1`, then definitely `3e^-322` is zero — rafaelc, Aug 07 '19 at 18:09
I second rafaelc. Just out of curiosity, what could this number possibly represent? — tnknepp, Aug 07 '19 at 18:13
@rafaelc If I only deal with `e^-322` would that be of any help? — Coddy, Aug 07 '19 at 18:14
@Coddy Ok, everything I know about atoms (mass, radius, etc.) are a few hundred orders of magnitude greater than E-320. I think rafaelc's point is that in the array you provided the E-322 values are all zero. The number's you are talking about are roughly 1E300 times smaller than the expected size of strings in string theory. Your numbers are FAR too small to represent anything physical. — tnknepp, Aug 07 '19 at 18:21
@Coddy If *all* your data is `something times e^-320`, then just drop the `e^320`. For standard scaling, it doesn't matter the magnitude, only your distribution — rafaelc, Aug 07 '19 at 18:25
@rafaelc I do not know the specific application that Coddy is dealing with, but these numbers are too small to represent anything physical. I recommend looking for errors in his processing and do a unit check. This seemss to be a potentially good example of being able to identify when your data is wrong. — tnknepp, Aug 07 '19 at 18:27
To be exact they are simulation feature values of crystal diffraction patterns. — Coddy, Aug 07 '19 at 18:28
@Coddy That's a little out of my wheelhouse (PhD in analytical chemistry), but for something that small you are a few hundred orders of magnitude below the Heisenberg uncertainty limits. I just don't see how we can measure or reliably calculate anything that small. I apologize for not making suggesions regarding your specific question, but I suggest again to check for errors in your processing and in your units. Anything E-300 is indistinguishable from zero. — tnknepp, Aug 07 '19 at 18:34
@rafaelc Yeah now you said that I am thinking it might be the case. I am a regular CS guy playing around with datasets. I will have a look if these values went haywire a few preprocessing steps ago. Thanks a lot :) — Coddy, Aug 07 '19 at 18:49

ASGM · Answer 1 · 2019-08-07T18:57:23.937

0

Multiply them.

You're right: your values are too small for Pandas to handle as floats. The minimum np.float64 value is ~2.22e-308. You can handle somewhat smaller values by using more obscure types like np.longdouble, but these have their limits too and can be system-dependent.

As some of the comments point out, most plausible use cases don't require values this small. But if yours does, one simple way to get around the float boundaries is to multiply all of your values by a consistent integer that brings them within the acceptable float range (perhaps by 10^320). You're not losing any information, just dropping a long sequence of zeroes.

Note: this only works if you're not simultaneously storing numbers too huge to multiply without breaking the float limits in the other direction. But this seems unlikely.

edited Aug 07 '19 at 18:57

answered Aug 07 '19 at 18:06

ASGM

11,051
1
32
53

Just a caveat, you'll lose precision anyway even when multiplying. If you have `a = 1.2345` and `C = 1e-322`, then `a * C / C` is not equal to `a` – rafaelc Aug 07 '19 at 18:44
That's an important point (though on my computer `a * C/C == a` evaluates to `True`). But it depends on how OP is storing them in the first place, right? If the values strings in some CSV file and they're doing a numeric conversion, then manipulating the strings as strings (by changing the exponent, for example, before numeric conversion) would fix the issue. – ASGM Aug 07 '19 at 18:52

score 0 · Answer 2 · answered Feb 20 '20 at 22:03

Store the log of the number, and reverse with exp when needed later. If you then need to shift them the shift is additive (instead of multiplicative). Working in the log-space helps avoid machine zero though you'll still have issues you need to deal with operating with the log values, i.e. log-of-sum != sum-of-logs

Jaskaran Singh · Answer 3 · 2019-08-07T18:04:50.050

-1

You should try normalization of your data to bring it within some scale of value. Here is the sample code

from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html You are receiving NAN because the numbers went off your handling scale.

EDIT1: Your error says that your dataset contains NAN values and cannot be converted to float64 type. Are you sure there are no empty values. If so try to drop those values using .drop function like below: DataFrame.drop()

edited Aug 07 '19 at 18:04

answered Aug 07 '19 at 18:02

Jaskaran Singh

531
3
14

I tried MinMaxScaler too but it is still giving me the same error. – Coddy Aug 07 '19 at 18:03
Please check EDIT 1 for the answer – Jaskaran Singh Aug 07 '19 at 18:06
I am sure its because of value underflow. – Coddy Aug 07 '19 at 18:16

How to deal with really small (order of -322) floating values in pandas dataframe?

3 Answers3