How to distinquish between floats, ints and scientific notation

Question

I'm writing a custom json compresser. It is going to be reading numbers in all formats. How do I print the values of the json in the format, it is given, with json.load(). I would also want to preserve the type.

Example of a file it would have to read would be:

{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2"}

I would also want it to be able to distinquish between 1, 1.0 and true

When I do a basic for loop to print the values and their types with json.load(), it prints out

301 int
301.0 float
301.0 float
301 str
301.0 str
3.01E2 str

And yes, I understand that scientific notations are floats

Excpected output would be

301 int
301.0 float
3.01E2 float
301 str
301.0 str
3.01E2 str

It is unclear what you need, you're talking about encoding then loading (decoding) what do you need ? — azro, Dec 10 '22 at 10:49
So what is the expected output ? or expected thing to have ? — azro, Dec 10 '22 at 11:00
Expected out would be 301 int 301.0 float 3.01E2 float 301 str 301.0 str 3.01E2 str — user15555955, Dec 10 '22 at 11:02
Python uses scientific notation for floats only if needed, for such a number it's not needed, so it doesn't — azro, Dec 10 '22 at 11:08
Yes, but the input contains the scientific notation, and I would like to havr it in the output — user15555955, Dec 10 '22 at 11:14
So see https://stackoverflow.com/questions/6913532/display-a-decimal-in-scientific-notation but I don't see way to know that this specific float was in scientific mode in the JSON, that 2 separates thins — azro, Dec 10 '22 at 11:15

Tranbi · Accepted Answer · 2022-12-10T14:23:59.837

0

So IIUC you want to keep the formatting of the json even if the value is given as float. I think the only way to do this is to change the type in your json i.e. adding quotes around float elements.

This can be done with regex:

import json
import re

data = """{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2", "g": true, "h":"hello"}"""

# the cricial part: enclosing float/int in quotes:
pattern = re.compile(r'(?<=:)\s*([+-]?\d+(?:\.\d*(?:E-?\d+)?)?)\b')
data_str = pattern.sub(r'"\1"', data)

val_dict = json.loads(data) # the values as normally read by the json module
type_dict = {k: type(v) for k,v in val_dict.items()} # their types
repr_dict = json.loads(data_str) # the representations (everything is a sting there)

# using Pandas for pretty formatting
import pandas as pd
df = pd.DataFrame([val_dict, type_dict, repr_dict], index=["Value", "Type", "Repr."]).T

Output:

    Value             Type   Repr.
a     301    <class 'int'>     301
b   301.0  <class 'float'>   301.0
c   301.0  <class 'float'>  3.01E2
d     301    <class 'str'>     301
e   301.0    <class 'str'>   301.0
f  3.01E2    <class 'str'>  3.01E2
g    True   <class 'bool'>    True
h   hello    <class 'str'>   hello

So here the details of the regex:

([+-]?\d+(?:\.\d*(?:E-?\d+)?)?) this is our matching group, consisting of:
- [+-]? optional leading + or -
- \d+ one or more digits, followed by (optionally):
- (?:\.\d*(?:E-?\d+)?)?: non capturing group made of
  - \. a dot
  - \d* zero or more digits
  - (optionally) an E with an (optional) minus - followed by one or more digits \d+
\b specifie a word boundary (so that the match doesn't cut a series of digits)
(?<=:) is a lookbehind, ensuring the expression is directly preceeded by : (we don't add quotes around existing strings)
\s* any white character before the expression is ignored/removed

\1 is a back reference to our (1st) group. So we replace the whole match with "\1"

Edit: slightly changed the regex to replace numbers directly following : and taking leading +/- into account

edited Dec 10 '22 at 14:23

answered Dec 10 '22 at 12:57

Tranbi

11,407
6
16
33

Would changing (\d+(?:\.\d*(?:E-?\d+)?)?) to (?:\+?:\-\d+(?:\.\d*(?:E-?\d+)?)?) make it also work work with -300 and +300 as inputs? – user15555955 Dec 10 '22 at 14:09
I edited my answer with a slightly more robust regex – Tranbi Dec 10 '22 at 14:24
Thanks for all the help. One thing i did notice was that if I have :4 or another number after a colon inside a string, it fails. I experimented a bit and found that this [https://regex101.com/r/nsP2zI/1](https://regex101.com/r/nsP2zI/1) seems to work pretty well at getting the keys with lists and special characters as values. However I couldn't figure out how to integrate this with your answer and have it choose the values, as I didn't quite understand how the lookbehind works – user15555955 Dec 11 '22 at 19:24
There are some redundancies in your regex, besides it matches more than the keys (some list elements as well). I think something like `(?:\{|,)\s*\"[^\"]*\":` would suffice. However lookbehinds must have fixed width in Python. So you cannot include this in it. You could check if the expression is followed by a comma by adding `\s*(?=,)` at the end of the pattern – Tranbi Dec 12 '22 at 09:51
Is there a way to count the commas(without backslashes) and check if they have a : after them? Because only checking the characters afterwards can lead to possible accisdent manipulation. That did fix the ":4", but now ":4," is broken. The only option that I could think of is somehow keeping a count of proceeding quotation marks – user15555955 Dec 13 '22 at 11:04
Regex is limited. If you want to exlude all possible errors, you'd be better off writing your own json parser that will read floats as strings – Tranbi Dec 13 '22 at 11:49
Welp, I just figuured out that i can replace the regex parsing with a "parse_float=str, parse_int=str" as json.loads' parametres. Thanks for the help though – user15555955 Dec 13 '22 at 12:35

How to distinquish between floats, ints and scientific notation

1 Answers1

Linked