Preface
I'll preface this by saying two things:
- I'm new(er) to python/pandas, and trying not to reinvent the wheel
- I've read the documentation for pandas.to_numeric
- I've read Why Are Floating Point Numbers Inaccurate? and Why can't decimal numbers be represented exactly in binary?. I understand the concepts we're working with here, and am trying to figure out a "good" solution to getting a modicum of precision
- I know it would be easier to ensure users only enter numbers, but I don't have control over the source data; the program is a large-scale data validation tool which has to accept everything from the user no matter what was provided
Background
I'm working on porting a project from R to python, and am struggling to find a good way to translate a text-to-float conversion process. Specifically, I'm:
- Reading a pandas dataframe in from a SQL database (pandas.read_sql)
- Casting a column which is stored as text back into a number using pd_to_numeric
- Using the
error='coerce'
option to force the coercion from text to floats - Checking the before/after to see if/where input was coerced to NaN
- Using the
Question
How to I get pandas to neatly coerce a text representation of a float value to the corresponding float cleanly?
The issue I'm running into is with the numerical precision. For example:
In[1]: pd.to_numeric('3.785', errors='coerce')
Out[2]: 3.7850000000000001
I understand why that's happening (i.e. 3.785 is represented in a way that's not easily presentable in the underlying binary representation). What I'm trying to figure out is if there's a way to work around why it's happening efficiently.
For example, in R, there's a lot that goes on under the hood, but in the as.numeric API, you get the number that's ostensibly represented by the text value:
> as.numeric('3.785')
[1] 3.785
which is my desired behavior. Is that possible using pandas/python? I'm open to incorporating other packages, or being told "that's impossible."