3

NB: My question is not a duplicate of Format floats with standard json module. In fact, Mark Dickinson provided a good answer to my question in one of his comments, and this answer is all about pandas.read_csv, which is not even mentioned in that earlier post. Although [pandas] was one of the post's tags from the beginning, I have now edited the title to make the connection with pandas explicit.


As a very minimal example, suppose that I have a file foo.csv with the following content:

foo
-482.044

Now, if I read this file in with pandas.read_csv, and dump a transform of these data using simplejson.dumps I get the following:

simplejson.dumps(pandas.read_csv('/tmp/foo.csv')
                       .to_dict(orient='index')
                       .values()[0])
# '{"foo": -482.04400000000004}'

IOW, the original -482.044 became -482.04400000000004.

NB: I understand why this happens.

What I'm looking for is some convenient way to get around it.

IOW, the desired JSON string in this case is something like

'{"foo": -482.044}'

I'm looking for a convenient way to generate this string, starting from the file foo.csv shown earlier.

Needless to say, this example is unrealistically simple. In practice, foo.csv would contain thousands/millions of rows, and tens/hundreds of columns, not all necessarily floats (or even numeric). I'm only interested in solutions that would work for such real-life data.

Of course, I could avoid floating-point issues altogether by passing dtype=str to pandas.read_csv, but this would not produce the desired result:

simplejson.dumps(pandas.read_csv('/tmp/foo.csv', dtype=str)
                       .to_dict(orient='index')
                       .values()[0])
# '{"foo": "-482.044"}'

To put it in different terms: I want the input CSV to serve as the explicit specification of how to serialize whatever floating point values it contains. Is there a simple/convenient way to achieve this?

kjo
  • 33,683
  • 52
  • 148
  • 265
  • So you want to read mixed data, converting floats to _something else_ (to prevent rounding errors), and then output them in json without the quotes... is that about the sum of it? – TemporalWolf Jun 21 '17 at 18:33
  • @TemporalWolf: that's one way of wording it, as long as the "without the quotes" applies only to those items that by default would have been read in as numeric – kjo Jun 21 '17 at 18:35
  • 2
    I think you're likely going to have to write your own custom class, complete with a [read_csv converter/dtype](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for float to your custom class and [JSONEncoder](https://docs.python.org/2/library/json.html#json.JSONEncoder) to output. – TemporalWolf Jun 21 '17 at 18:52
  • 1
    Thinking more on this, you might be able to get away with some sort of sentinel string value conversion, along the lines of `"THIS_IS_A_FLOAT_STRING_CONVERT_BACK -482.044"` and then overriding the string JSONEncoder, but that is prone to issues if your sentinel value is in the data set somewhere. – TemporalWolf Jun 21 '17 at 19:06
  • Possible duplicate of [Format floats with standard json module](https://stackoverflow.com/questions/1447287/format-floats-with-standard-json-module) – TemporalWolf Jun 21 '17 at 19:18
  • Monkey patching `json.encoder.FLOAT_REPR`, as suggested in the dupe target, may be the simplest answer. – TemporalWolf Jun 21 '17 at 19:19
  • Not familiar with panda, but can't you just read the value as string, somehow (I don't know if panda allows that) and then save it back as string too? After all, this is all just a text to text conversion and ideally, you wouldn't convert to and from a float format at all. If necessary, write your own code to preserve the string. – Rudy Velthuis Jun 21 '17 at 19:33
  • 3
    Have you tried using `pandas.read_csv(..., float_precision="round_trip")`? One of the contributing factors here is that the Pandas CSV float reading sacrifices accuracy for speed; `float_precision = "round_trip"` instructs it to use a more accurate string-to-float conversion algorithm. – Mark Dickinson Jun 22 '17 at 10:37
  • @MarkDickinson: That's the ticket, at least for the example I posted, and all the other spot checks I've tried. Awesome! Thank you! I hope you'll post this as the answer. – kjo Jun 22 '17 at 10:58
  • Have a look at this [Decimal class rounding in Pandas](https://stackoverflow.com/a/54774317/1810559). – denson Jan 12 '20 at 21:29

1 Answers1

0

pandas uses numpy and converts your data -482.044 as float64. But the real set is very dense. Then a set of floats as the same representant, here it is -482.04400000000004. The float -482.044 is rounded to closest representant.

https://en.wikipedia.org/wiki/IEEE_floating_point

Here:

>>> import numpy as np
>>> np.float64(-482.044)
-482.04400000000004
>>> float(-482.044)
-482.044
>>> float(-482.044) == np.float64(-482.044)
True

because numpy float hasn't the same representation than native python float.

You can use that:

def truncate(n, n_digits):
    i, d = str(float(n)).split('.')
    return '.'.join([i, d[:n_digits]])

For your issue:

foo.csv:

foo
-482.044

Python script:

# python3

import simplejson
import pandas

# /!\ if dtype=float here it is float numpy
df = pandas.read_csv('foo.csv', dtype=str) 
# here it is native float python
df['foo'] = df['foo'].apply(float) 

data = simplejson.dumps({'foo': df.values[0][0]})
# data = '{"foo": -482.044}'
glegoux
  • 3,505
  • 15
  • 32
  • I don't see how to put your answer to use given the context I presented in my question. Could you please show what your answer entails in that context? IOW, how would you read the file `foo.csv`, and how would you generate the desired JSON string from the read data? – kjo Jun 21 '17 at 19:17
  • 2
    "float of native python is different than float in numpy." <- No, they use _exactly_ the same format to store the underlying value. The only difference is that they choose to display the values differently under `repr`. `isinstance(np.float64(1.1), float)` -> `True` – Mark Dickinson Jun 22 '17 at 10:47