-1

I'm writing a custom json compresser. It is going to be reading numbers in all formats. How do I print the values of the json in the format, it is given, with json.load(). I would also want to preserve the type.

Example of a file it would have to read would be:

{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2"}

I would also want it to be able to distinquish between 1, 1.0 and true

When I do a basic for loop to print the values and their types with json.load(), it prints out

301 int
301.0 float
301.0 float
301 str
301.0 str
3.01E2 str

And yes, I understand that scientific notations are floats

Excpected output would be

301 int
301.0 float
3.01E2 float
301 str
301.0 str
3.01E2 str
  • It is unclear what you need, you're talking about encoding then loading (decoding) what do you need ? – azro Dec 10 '22 at 10:49
  • Sorry about that, had a brain fart. I mean a loader. – user15555955 Dec 10 '22 at 10:51
  • So what is the expected output ? or expected thing to have ? – azro Dec 10 '22 at 11:00
  • Expected out would be 301 int 301.0 float 3.01E2 float 301 str 301.0 str 3.01E2 str – user15555955 Dec 10 '22 at 11:02
  • Python uses scientific notation for floats only if needed, for such a number it's not needed, so it doesn't – azro Dec 10 '22 at 11:08
  • Yes, but the input contains the scientific notation, and I would like to havr it in the output – user15555955 Dec 10 '22 at 11:14
  • So see https://stackoverflow.com/questions/6913532/display-a-decimal-in-scientific-notation but I don't see way to know that this specific float was in scientific mode in the JSON, that 2 separates thins – azro Dec 10 '22 at 11:15

1 Answers1

0

So IIUC you want to keep the formatting of the json even if the value is given as float. I think the only way to do this is to change the type in your json i.e. adding quotes around float elements.

This can be done with regex:

import json
import re

data = """{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2", "g": true, "h":"hello"}"""

# the cricial part: enclosing float/int in quotes:
pattern = re.compile(r'(?<=:)\s*([+-]?\d+(?:\.\d*(?:E-?\d+)?)?)\b')
data_str = pattern.sub(r'"\1"', data)

val_dict = json.loads(data) # the values as normally read by the json module
type_dict = {k: type(v) for k,v in val_dict.items()} # their types
repr_dict = json.loads(data_str) # the representations (everything is a sting there)

# using Pandas for pretty formatting
import pandas as pd
df = pd.DataFrame([val_dict, type_dict, repr_dict], index=["Value", "Type", "Repr."]).T

Output:

    Value             Type   Repr.
a     301    <class 'int'>     301
b   301.0  <class 'float'>   301.0
c   301.0  <class 'float'>  3.01E2
d     301    <class 'str'>     301
e   301.0    <class 'str'>   301.0
f  3.01E2    <class 'str'>  3.01E2
g    True   <class 'bool'>    True
h   hello    <class 'str'>   hello

So here the details of the regex:

  • ([+-]?\d+(?:\.\d*(?:E-?\d+)?)?) this is our matching group, consisting of:
    • [+-]? optional leading + or -
    • \d+ one or more digits, followed by (optionally):
    • (?:\.\d*(?:E-?\d+)?)?: non capturing group made of
      • \. a dot
      • \d* zero or more digits
      • (optionally) an E with an (optional) minus - followed by one or more digits \d+
  • \b specifie a word boundary (so that the match doesn't cut a series of digits)
  • (?<=:) is a lookbehind, ensuring the expression is directly preceeded by : (we don't add quotes around existing strings)
  • \s* any white character before the expression is ignored/removed

\1 is a back reference to our (1st) group. So we replace the whole match with "\1"

Edit: slightly changed the regex to replace numbers directly following : and taking leading +/- into account

Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • Would changing (\d+(?:\.\d*(?:E-?\d+)?)?) to (?:\+?:\-\d+(?:\.\d*(?:E-?\d+)?)?) make it also work work with -300 and +300 as inputs? – user15555955 Dec 10 '22 at 14:09
  • I edited my answer with a slightly more robust regex – Tranbi Dec 10 '22 at 14:24
  • Thanks for all the help. One thing i did notice was that if I have :4 or another number after a colon inside a string, it fails. I experimented a bit and found that this [https://regex101.com/r/nsP2zI/1](https://regex101.com/r/nsP2zI/1) seems to work pretty well at getting the keys with lists and special characters as values. However I couldn't figure out how to integrate this with your answer and have it choose the values, as I didn't quite understand how the lookbehind works – user15555955 Dec 11 '22 at 19:24
  • There are some redundancies in your regex, besides it matches more than the keys (some list elements as well). I think something like `(?:\{|,)\s*\"[^\"]*\":` would suffice. However lookbehinds must have fixed width in Python. So you cannot include this in it. You could check if the expression is followed by a comma by adding `\s*(?=,)` at the end of the pattern – Tranbi Dec 12 '22 at 09:51
  • Is there a way to count the commas(without backslashes) and check if they have a : after them? Because only checking the characters afterwards can lead to possible accisdent manipulation. That did fix the ":4", but now ":4," is broken. The only option that I could think of is somehow keeping a count of proceeding quotation marks – user15555955 Dec 13 '22 at 11:04
  • Regex is limited. If you want to exlude all possible errors, you'd be better off writing your own json parser that will read floats as strings – Tranbi Dec 13 '22 at 11:49
  • Welp, I just figuured out that i can replace the regex parsing with a "parse_float=str, parse_int=str" as json.loads' parametres. Thanks for the help though – user15555955 Dec 13 '22 at 12:35