1

I have a wrongly-formatted JSON file where I have numbers with leading zeroes.

p = """[
{
    "name": "Alice",
    "RegisterNumber": 911100020001
},
{
    "name": "Bob",
    "RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)

I get this error.

JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)

Here's what is on char 107:

print(p[107])
#0

The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:

"""[
{
    "name": "Alice",
    "RegisterNumber": "911100020001"
},
{
    "name": "Bob",
    "RegisterNumber": "000111110300"
}
]"""

How can I do it?

lispguy
  • 105
  • 2
  • 10
  • 3
    Error message points you where error is, leading zeroes isn't an issue. – Olvin Roght Jun 09 '20 at 21:03
  • Hint: expecting comma delimiter, per the error message. – jarmod Jun 09 '20 at 21:04
  • I updated the code with a new runnable code where the error persists. – lispguy Jun 09 '20 at 21:05
  • 1
    @lispguy, you forgot ot mention new error message.. – Olvin Roght Jun 09 '20 at 21:07
  • I didn't. I updated the error message. – lispguy Jun 09 '20 at 21:08
  • Indeed, the comma isn't the issue, easy to test as I did. Leading zeros are not valid json it seems, see also here: https://stackoverflow.com/questions/27361565/why-is-json-invalid-if-an-integer-begins-with-a-leading-zero – Dr. V Jun 09 '20 at 21:13
  • @lispguy care to explain where you got that invalid json? Because may be it would be easier to fix the serializer – Loïc Faure-Lacroix Jun 09 '20 at 21:13
  • It's a legacy resource in a .json file on a Java library I am maintaining. Awkwardly, Java never complained. I need to add more elements to the array, with some logic (I can't have duplicates, once I have parsed this JSON, I can write a little piece of code to eliminate duplicates). – lispguy Jun 09 '20 at 21:16
  • 1
    Wow, this was even dangerous to parse in java, as it could have been interpreted as octal numbers (due to leading zeros). Then interpreting this as decimal numbers and simply padding with zeros would have messed up the register-numbers. – Dr. V Jun 09 '20 at 21:32
  • As @Dr.V said, leading 0 can be interpreted as Octal. It doesn't look like actually valid json because even in Javascript it would fail to parse. And since any sane library would output numbers without leading 0... It smell like the Java library isn't even dumping JSON but is manually serializing JSON with prints... and formatted the int with leading 0 without quoting them. It's just that there's no meaning to padded 0 for a int. There's no way the parser would be able to load and dump it in the same format. – Loïc Faure-Lacroix Jun 10 '20 at 02:08

3 Answers3

5

Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (remodule). Then save and later parse the valid json.

If it fits into memory, you don't need to save the file of course, but just loads the then valid json string.

Here is a simple version:

import json

p = """[
{
    "name": "Alice",
    "RegisterNumber": 911100020001
},
{
    "name": "Bob",
    "RegisterNumber": 000111110300
}
]"""

from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)

arc = json.loads(p)
print(arc[1])
Dr. V
  • 1,747
  • 1
  • 11
  • 14
  • What if key name will contain 12-digit number? – Olvin Roght Jun 09 '20 at 21:27
  • My solution assumes input as given by lispguy. The answer of Loïc Faure-Lacroix is better for a more general case where 12-digit numbers can occur other places. – Dr. V Jun 09 '20 at 21:29
2

This probably won't be pretty but you could probably fix this using a regex.

import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)

This will match all the case where you have the RegisterNumber followed by numbers.

Loïc Faure-Lacroix
  • 13,220
  • 6
  • 67
  • 99
1

Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.

data = """[
{
    "name": "Alice",
    "RegisterNumber": 911100020001
},
{
    "name": "Bob",
    "RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
    if ': 0' in line:
        while ": 0" in line:
            line = line.replace(': 0', ': ')
        result.append(line.replace(': ', ': "')+'"')
    else:
        result.append(line)
data = "".join(result)

arc = json.loads(data)
print(arc)
holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • That's only good if you know ahead of time that numbers have a predefined amount of chars. Otherwise you could have a number 0011 being different from 00011. It's probably unlikely since 0 are often used as padding.. but who knows? – Loïc Faure-Lacroix Jun 10 '20 at 02:10
  • 1
    I refuse to accept any criticism of code I described in the post as "cheap and nasty" ;-) – holdenweb Jun 10 '20 at 15:42
  • Oh no, I'm not criticizing. The real problem here is a json serializer that output invalid json. – Loïc Faure-Lacroix Jun 10 '20 at 22:01