4

Preface

I'll preface this by saying two things:

  • I'm new(er) to python/pandas, and trying not to reinvent the wheel
  • I've read the documentation for pandas.to_numeric
  • I've read Why Are Floating Point Numbers Inaccurate? and Why can't decimal numbers be represented exactly in binary?. I understand the concepts we're working with here, and am trying to figure out a "good" solution to getting a modicum of precision
  • I know it would be easier to ensure users only enter numbers, but I don't have control over the source data; the program is a large-scale data validation tool which has to accept everything from the user no matter what was provided

Background

I'm working on porting a project from R to python, and am struggling to find a good way to translate a text-to-float conversion process. Specifically, I'm:

  • Reading a pandas dataframe in from a SQL database (pandas.read_sql)
  • Casting a column which is stored as text back into a number using pd_to_numeric
    • Using the error='coerce' option to force the coercion from text to floats
    • Checking the before/after to see if/where input was coerced to NaN

Question

How to I get pandas to neatly coerce a text representation of a float value to the corresponding float cleanly?

The issue I'm running into is with the numerical precision. For example:

In[1]: pd.to_numeric('3.785', errors='coerce')
Out[2]: 3.7850000000000001

I understand why that's happening (i.e. 3.785 is represented in a way that's not easily presentable in the underlying binary representation). What I'm trying to figure out is if there's a way to work around why it's happening efficiently.

For example, in R, there's a lot that goes on under the hood, but in the as.numeric API, you get the number that's ostensibly represented by the text value:

> as.numeric('3.785')
[1] 3.785

which is my desired behavior. Is that possible using pandas/python? I'm open to incorporating other packages, or being told "that's impossible."

JohnE
  • 29,156
  • 8
  • 79
  • 109
Adam Bethke
  • 1,028
  • 2
  • 19
  • 35
  • 1
    Longer answer: Yes. Make sure you're using NumPy >= 1.14.0. The output number you're seeing here is a NumPy `float64` instance, and NumPy 1.14.0 changed the floating-point printing functions so that you'll see a concise representation much more often. – Mark Dickinson Jan 31 '18 at 16:57
  • Thank you @MarkDickinson! That was exactly what I was looking/hoping for. I really appreciate it. – Adam Bethke Jan 31 '18 at 17:04
  • @cᴏʟᴅsᴘᴇᴇᴅ: I don't think the duplicate is remotely appropriate – Mark Dickinson Feb 02 '18 at 07:35
  • @MarkDickinson Reopened, my mistake. Feel free to answer. – cs95 Feb 02 '18 at 07:43

1 Answers1

4

I'm not sure if you are asking more about precision or the display (formatting) of precision, but hopefully this answer will address both questions.

Behind the scenes, r and pandas are likely doing the exact same thing, but the default display is making it seem otherwise. For example, if you want a better idea of how r is storing the number, you could do this:

> sprintf( "%.20f", as.numeric('3.785') )
[1] "3.78500000000000014211"

Of course, once you get past the 14th decimal place or so you are exceeding what can be stored in a double precision float and the digits aren't really meaningful. Still, this should at least be sufficient to show that r is not really storing it as exactly 3.785.

Admittedly, there is a more rigorous answer than that out there somewhere, but as a practical matter "just ignore anything past the 13th or 14th decimal" is almost always sufficient for standard data work (and you almost certainly are better off spending your time elsewhere than worrying about the 14th decimal place).

I can't say exactly why the pandas function you used showed 3.7850000000000001 rather than 3.785 but generally speaking you can get different levels of default display precision in different parts of numpy and pandas. For example if you just put [ and ] around your number (to pass a list rather than a scalar), pd.to_numeric would have output a numpy array, and shown you fewer decimal places:

In [61]: pd.to_numeric(['3.785'], errors='coerce')
Out[61]: array([ 3.785])

You would also get similar display output if you passed a Series rather than list. If you want a particular level of display precision, you just have to specify it explicitly:

In [62]: pd.to_numeric(pd.Series(['3.785']), errors='coerce').map('{:,.20f}'.format)
Out[62]: 
0    3.78500000000000014211

Hence, when you explicitly display 20 decimal places, you see the same numerical representation in both r and pandas/numpy.

It is perhaps worth mentioning as an aside that python does offer a Decimal type which feature exact decimals (unlike either single or double precision floats) but it is rare to use this for data science or numerical applications as it is not a native pandas type and performance is generally going to be poor. I believe it is intended mainly for accounting applications.

JohnE
  • 29,156
  • 8
  • 79
  • 109