36

From the 2gis API I got the following JSON string.

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}

But Python doesn't recognize it:

json.loads(firm_str)

Expecting , delimiter: line 1 column 3646 (char 3645)

It looks like a problem with quotes in: "title": "Center "ADVANCE""

How can I fix it automatically in Python?

eandersson
  • 25,781
  • 8
  • 89
  • 110
Anton Barycheuski
  • 712
  • 2
  • 9
  • 21
  • 4
    This is an encoding issue, not a JSON issue. – Gijs Aug 29 '13 at 15:23
  • encoding is right. don't pay attention to the strange characters – Anton Barycheuski Aug 29 '13 at 15:26
  • Can you isolate this to a specific, small example? Remove pieces until you are left with the bit that breaks. – Joe Aug 29 '13 at 15:26
  • 1
    I think the problem is that there are double quotes in a string delimited with double quotes. Try `"title": "bla 'ADVANCE'"` or `"title": 'bla "ADVANCE"'` instead. It should be possible to build a regex to find those... – tobias_k Aug 29 '13 at 15:27
  • @tobias_k, what happens if there is a comma inside those quotes? Now it may become ambiguous – John La Rooy Aug 29 '13 at 15:30
  • @tobias_k, what regexp can you suggest? notice that such not escaped quotes can be in any text value – Anton Barycheuski Aug 29 '13 at 15:33
  • @gnibbler: Good point, didn't thought of that. This definitely makes it more complicated, if not impossible, particularly if it's all on one line... Anton, is there some pattern to this? Is it allways in the "title" attribute, or always "ADVANCE", or something like this? – tobias_k Aug 29 '13 at 15:35
  • @tobias_k, i think, there is no one common pattern. Situation, that described by gnibbler, may be, too, but rarely. – Anton Barycheuski Aug 29 '13 at 15:40
  • This is no regex, and it's a bit tricky, but you could do something like this: Count the `"`; after two `"`, there should be a colon, then another `"`; after the next `"` there should be a comma and another `"` (maybe with some whitespace in between); _if not_, escape that `"` and continue, else repeat. This might still fail, but it's a start... – tobias_k Aug 29 '13 at 15:40
  • 4
    Consider what will happen if a later version of the API fixes this bug. Make sure whatever workaround you use won't cause a new bug in your code when they fix theirs. – John La Rooy Aug 29 '13 at 15:58
  • I'm not sure a perfect solution is possible. Consider the following title: `Center "},"warning":"spoofed_value" dummy: {"dummy": "dummy` (with some newlines added in). Any checker would have to track back quite far to determine what to escape. If the JSON contains several values like these it would not be able to work it's way back starting from the end either. – r.m Nov 23 '13 at 00:30
  • @AntonBarycheuski is the JSON response exactly as you posted (with a newline after each key-value pair) ? If so please consider my answer: I posted a function that just fixes the unescaped strings (avoiding the parse-fix-parse-fix--.. potentially infinite loop in the current accepted answer) – Paolo Nov 24 '13 at 08:34

11 Answers11

46

The answer by @Michael gave me an idea... not a very pretty idea, but it seems to work, at least on your example: Try to parse the JSON string, and if it fails, look for the character where it failed in the exception string1 and replace that character.

while True:
    try:
        result = json.loads(s)   # try to parse...
        break                    # parsing worked -> exit loop
    except Exception as e:
        # "Expecting , delimiter: line 34 column 54 (char 1158)"
        # position of unexpected character after '"'
        unexp = int(re.findall(r'\(char (\d+)\)', str(e))[0])
        # position of unescaped '"' before that
        unesc = s.rfind(r'"', 0, unexp)
        s = s[:unesc] + r'\"' + s[unesc+1:]
        # position of correspondig closing '"' (+2 for inserted '\')
        closg = s.find(r'"', unesc + 2)
        s = s[:closg] + r'\"' + s[closg+1:]
print result

You may want to add some additional checks to prevent this from ending in an infinite loop (e.g., at max as many repetitions as there are characters in the string). Also, this will still not work if an incorrect " is actually followed by a comma, as pointed out by @gnibbler.

Update: This seems to work pretty well now (though still not perfect), even if the unescaped " is followed by a comma, or closing bracket, as in this case it will likely get a complaint about a syntax error after that (expected property name, etc.) and trace back to the last ". It also automatically escapes the corresponding closing " (assuming there is one).


1) The exception's str is "Expecting , delimiter: line XXX column YYY (char ZZZ)", where ZZZ is the position in the string where the error occurred. Note, though, that this message may depend on the version of Python, the json module, the OS, or the locale, and thus this solution may have to be adapted accordingly.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • Awesome. This doesn't fix all my issues, but it is a good start. – eandersson Nov 16 '13 at 21:53
  • @eandersson Could you be more specific what you expect the "more complete solution" to look like? What are the cases where my solution does not work? (I am sure there are many, but which ones are relevant for you?) – tobias_k Nov 17 '13 at 13:41
  • I was primarily referring to a library to handle these type of oddities, but also a more flexible solution to fix extreme case issues that this can't handle. An example would be a string with a single extra double quote. – eandersson Nov 17 '13 at 15:09
  • btw really I just wanted to see if there was something else out there, otherwise I'll just award the bonus points to this answer. – eandersson Nov 17 '13 at 15:16
  • As of Python 3.5 you can use [JsonDecodeError.pos](https://docs.python.org/3/library/json.html#json.JSONDecodeError.pos) to get the position. – Capi Etheriel May 25 '19 at 21:02
  • If there's another type of issue in the JSON this will run indefinitely, I wouldn't suggest using this without keeping track of how many iterations it does and stopping at some threshold. – Tomer Gal Apr 08 '23 at 15:16
8

If this is exactly what the API is returning then there is a problem with their API. This is invalid JSON. Especially around this area:

"ads": {
            "sponsored_article": {
                "title": "Образовательный центр "ADVANCE"", <-- here
                "text": "Бизнес.Риторика.Английский язык.Подготовка к школе.Подготовка к ЕГЭ."
            },
            "warning": null
        }

The double quotes around ADVANCE are not escaped. You can tell by using something like http://jsonlint.com/ to validate it.

This is a problem with the " not being escaped, the data is bad at the source if this is what you are getting. They need to fix it.

Parse error on line 4:
...азовательный центр "ADVANCE"",         
-----------------------^
Expecting '}', ':', ',', ']'

This fixes the problem:

"title": "Образовательный центр \"ADVANCE\"",
atorres757
  • 601
  • 5
  • 9
6

The only real and definitive solution is to ask 2gis to fix their API.

In the meantime it is possible to fix the badly encoded JSON escaping double quotes inside strings. If every key-value pair is followed by a newline (as it seems to be from the posted data) the following function will do the job:

def fixjson(badjson):
    s = badjson
    idx = 0
    while True:
        try:
            start = s.index( '": "', idx) + 4
            end1  = s.index( '",\n',idx)
            end2  = s.index( '"\n', idx)
            if end1 < end2:
                end = end1
            else:
                end = end2
            content = s[start:end]
            content = content.replace('"', '\\"')
            s = s[:start] + content + s[end:]
            idx = start + len(content) + 6
        except:
            return s

Please, note that some assumtions made:

The function attemps to escape double quotes characters inside value string belonging to key-value pairs.

It is assumed that the text to be escaped begins after the sequence

": "

and ends before the sequence

",\n

or

"\n

Passing the posted JSON to the function results in this returned value

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center \"ADVANCE\"",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}

Keep in mind you can easily customize the function if your needs are not fully satisfied.

Paolo
  • 15,233
  • 27
  • 70
  • 91
5

The above Idea is good but I had problem with that. My json Sting consisted only one additional double quote in it. So, I made a fix to the above given code.

The jsonStr was

{
    "api_version": "1.3",
    "response_code": "200",
    "id": "3237490513229753",
    "lon": "38.969916127827",
    "lat": "45.069889625267",
    "page_url": null,
    "name": "ATB",
    "firm_group": {
        "id": "3237499103085728",
        "count": "1"
    },
    "city_name": "Krasnodar",
    "city_id": "3237585002430511",
    "address": "Turgeneva,   172/1",
    "create_time": "2008-07-22 10:02:04 07",
    "modification_time": "2013-08-09 20:04:36 07",
    "see_also": [
        {
            "id": "3237491513434577",
            "lon": 38.973110606808,
            "lat": 45.029031222211,
            "name": "Advance",
            "hash": "5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e",
            "ads": {
                "sponsored_article": {
                    "title": "Center "ADVANCE",
                    "text": "Business.English."
                },
                "warning": null
            }
        }
    ]
}

The fix is as follows:

import json, re
def fixJSON(jsonStr):
    # Substitue all the backslash from JSON string.
    jsonStr = re.sub(r'\\', '', jsonStr)
    try:
        return json.loads(jsonStr)
    except ValueError:
        while True:
            # Search json string specifically for '"'
            b = re.search(r'[\w|"]\s?(")\s?[\w|"]', jsonStr)

            # If we don't find any the we come out of loop
            if not b:
                break

            # Get the location of \"
            s, e = b.span(1)
            c = jsonStr[s:e]

            # Replace \" with \'
            c = c.replace('"',"'")
            jsonStr = jsonStr[:s] + c + jsonStr[e:]
        return json.loads(jsonStr)

This code also works for JSON string mentioned in problem statement


OR you can also do this:

def fixJSON(jsonStr):
    # First remove the " from where it is supposed to be.
    jsonStr = re.sub(r'\\', '', jsonStr)
    jsonStr = re.sub(r'{"', '{`', jsonStr)
    jsonStr = re.sub(r'"}', '`}', jsonStr)
    jsonStr = re.sub(r'":"', '`:`', jsonStr)
    jsonStr = re.sub(r'":', '`:', jsonStr)
    jsonStr = re.sub(r'","', '`,`', jsonStr)
    jsonStr = re.sub(r'",', '`,', jsonStr)
    jsonStr = re.sub(r',"', ',`', jsonStr)
    jsonStr = re.sub(r'\["', '\[`', jsonStr)
    jsonStr = re.sub(r'"\]', '`\]', jsonStr)

    # Remove all the unwanted " and replace with ' '
    jsonStr = re.sub(r'"',' ', jsonStr)

    # Put back all the " where it supposed to be.
    jsonStr = re.sub(r'\`','\"', jsonStr)

    return json.loads(jsonStr)
theBuzzyCoder
  • 2,652
  • 2
  • 31
  • 26
  • 1
    Nice, but instead of removing all the `"` from the text, why not replace them with some placeholder char (not otherwise found in the string) and then replace them back with the escaped quote after the proper `"` have been put back in? – tobias_k Oct 17 '15 at 09:43
  • It would be a proper way to do it. – theBuzzyCoder Nov 12 '15 at 03:49
3

I make a jsonfixer to solve a problem like this.

It's Python Package (2.7) (a half-done command line tool)

just see https://github.com/half-pie/half-json

from half_json.core import JSONFixer
f = JSONFixer(max_try=100)
new_s = s.replace('\n', '')
result = f.fix(new_s)
d = json.loads(result.line)
# {u'name': u'ATB', u'modification_time': u'2013-08-09 20:04:36 07', u'city_id': u'3237585002430511', u'see_also': [{u'hash': u'5698hn745A8IJ1H86177uvgn94521J3464he26763737242Cf6e654G62J0I7878e', u'ads': {u'warning': None, u'sponsored_article': {u'ADVANCE': u',                    ', u'text': u'Business.English.', u'title': u'Center '}}, u'lon': 38.973110606808, u'lat': 45.029031222211, u'id': u'3237491513434577', u'name': u'Advance'}], u'response_code': u'200', u'lon': u'38.969916127827', u'firm_group': {u'count': u'1', u'id': u'3237499103085728'}, u'create_time': u'2008-07-22 10:02:04 07', u'city_name': u'Krasnodar', u'address': u'Turgeneva,   172/1', u'lat': u'45.069889625267', u'id': u'3237490513229753', u'api_version': u'1.3', u'page_url': None}

and test case in https://github.com/half-pie/half-json/blob/master/tests/test_cases.py#L76-L80

    line = '{"title": "Center "ADVANCE"", "text": "Business.English."}'
    ok, newline, _ = JSONFixer().fix(line)
    self.assertTrue(ok)
    self.assertEqual('{"title": "Center ","ADVANCE":", ","text": "Business.English."}', newline)
tink
  • 123
  • 6
  • Thank you for creating this package! I've used it in my project and found it to be incredibly helpful. Your efforts are greatly appreciated! – abdullah.cu Aug 18 '23 at 06:48
2

You need to escape double quotes in JSON strings, as follows:

"title": "Образовательный центр \"ADVANCE\"",

To fix it programmatically, the simplest way would be to modify your JSON parser so you have some context for the error, then attempt to repair it.

Jakub Muda
  • 6,008
  • 10
  • 37
  • 56
Michael Foukarakis
  • 39,737
  • 6
  • 87
  • 123
1
def fix_json(jsonStr):
    # Remove all empty spaces to make things easier bellow
    jsonStr = jsonStr.replace('" :','":').replace(': "',':"').replace('"\n','"').replace('" ,','",').replace(', "',',"')
    # First remove the " from where it is supposed to be.
    jsonStr = re.sub(r'\\"', '"', jsonStr)
    jsonStr = re.sub(r'{"', '{`', jsonStr)
    jsonStr = re.sub(r'"}', '`}', jsonStr)
    jsonStr = re.sub(r'":"', '`:`', jsonStr)
    jsonStr = re.sub(r'":\[', '`:[', jsonStr)
    jsonStr = re.sub(r'":\{', '`:{', jsonStr)
    jsonStr = re.sub(r'":([0-9]+)', '`:\\1', jsonStr)
    jsonStr = re.sub(r'":([null|true|false])', '`:\\1', jsonStr)
    jsonStr = re.sub(r'","', '`,`', jsonStr)
    jsonStr = re.sub(r'",\[', '`,[', jsonStr)
    jsonStr = re.sub(r'",\{', '`,{', jsonStr)
    jsonStr = re.sub(r',"', ',`', jsonStr)
    jsonStr = re.sub(r'\["', '[`', jsonStr)
    jsonStr = re.sub(r'"\]', '`]', jsonStr)
    # Backslash all double quotes (")
    jsonStr = re.sub(r'"','\\"', jsonStr)
    # Put back all the " where it is supposed to be.
    jsonStr = re.sub(r'\`','\"', jsonStr)
    return jsonStr

It's based on @theBuzzyCoder code above, thanks mate for the idea.

  • Works like a charm. Added ```jsonStr = jsonStr[1:-1] if jsonStr[0] == '"' and jsonStr[-1]=='"' else jsonStr``` at the beginning of a function to remove extra double quotes – Ilya Jul 05 '23 at 20:23
0

Within sources of https://fix-json.com I found a solution, but it's very dirty and looks like a hack. Just adapt it to python

jsString.match(/:.*"(.*)"/gi).forEach(function(element){
   var filtered = element.replace(/(^:\s*"|"(,)?$)/gi, '').trim();
   jsString = jsString.replace(filtered, filtered.replace(/(\\*)\"/gi, "\\\""));
});
Gibolt
  • 42,564
  • 15
  • 187
  • 127
Frost
  • 31
  • 5
0

it's not perfect and ugly but it helps to me

def get_json_info(info_row: str, type) -> dict:
    try:
        info = json.loads(info_row)
    except JSONDecodeError:
        data = {
        }
        try:

            for s in info_row.split('","'):
                if not s:
                    continue
                key, val = s.split(":", maxsplit=1)
                key = key.strip().lstrip("{").strip('"')
                val: str = re.sub('"', '\\"', val.lstrip('"').strip('\"}'))
                data[key] = val
        except ValueError:
            print("ERROR:", info_row)
        info = data
    return info
madjardi
  • 5,649
  • 2
  • 37
  • 37
0

Fix #1

If you fetched it from some website, please make sure you are using the same string. In my case, I was doing .replace('\\"','"') . Because of this, the data was not the json anymore. If you also did something. like that, please fix that.

Fix #2

Try adding some character in all the places insted of the key name. It will be fine.

shekhar chander
  • 600
  • 8
  • 14
0
def extract_json_objects(text, decoder=JSONDecoder()):
    results = []
    pos = 0
    while True:
        match = text.find('{', pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            results.append(result)
            pos = match + index
        except ValueError:
            pos = match + 1
    return results


response = 'some text {"name": {"fname":"John","lname":"DEO"}, "age": 30, "details": {"city": "New York", "country": "USA"}} more text\n{"fname":"John","lname":"DEO"}'

print(extract_json_objects(response))

output:

[{'name': {'fname': 'John', 'lname': 'DEO'}, 'age': 30, 'details': {'city': 'New York', 'country': 'USA'}}, {'fname': 'John', 'lname': 'DEO'}]
Pankaj
  • 1