Dealing with mis-escaped characters in JSON

Question

I am reading a JSON file into Python which contains escaped single quotes (\'). This leads to all kinds of hiccups, as nicely discussed e.g. here. However, I could not find anything on how to address the issue. I just did a

newstring=originalstring.replace(r"\'", "'")

and things worked out. But this seems rather ugly. I could not really find much material on how to deal with this kind of thing (creating an exception, or something) in the json docs either.

Is there a good, clean procedure for such an issue?

Going back to the source is not possible, unfortunately.

Thanks for your help!

The `\'` character sequence is indeed invalid JSON, because there is no such escape sequence in JSON. How did you produce this output in the first place? — Martijn Pieters, Jun 07 '16 at 21:13
Hi Martijn, I did not produce it at all, just working with what was given to me. Apparently it was created with "Export to JSON plugin for PHPMyAdmin". — patrick, Jun 08 '16 at 13:31
Do consider filing a bug report to that project in that case. — Martijn Pieters, Jun 08 '16 at 13:34
Sure, I just thought it was probably a problem with user settings (I've never worked with PHPMyAdmin). This seems like a straightforward thing for a *json plugin* to get right? I'll look into it. — patrick, Jun 08 '16 at 13:42

score 4 · Accepted Answer · edited Oct 07 '21 at 08:57

The JSON standard defines specific set of valid 2-character escape sequences: \\, \/, \", \b, \r, \n, \f and \t, and one 4-character escape sequence to define any Unicode codepoint, \uhhhh (\u plus 4 hex digits). Any other sequence of backslash plus other character is invalid JSON.

If you have a JSON source you can't fix otherwise, the only way out is to remove the invalid sequences, like you did with str.replace() even if it is a little fragile (it'll break when there is an even backslash sequence preceding the quote).

You could use a regular expression too, where you remove any backslashes not used in a valid sequence:

fixed = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', inputstring)

This won't catch out an odd-count backslash sequence like \\\ but will catch anything else:

>>> import re, json
>>> broken = r'"JSON string with escaped quote: \' and various other broken escapes: \a \& \$ and a newline!\n"'
>>> json.loads(broken)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 34 (char 33)
>>> json.loads(re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', broken))
"JSON string with escaped quote: ' and various other broken escapes: a & $ and a newline!\n"

score 2 · Answer 2 · answered Jun 07 '16 at 21:27

The right thing would be to fix whatever is creating the invalid JSON file. But if that's not possible, I guess the replace is needed. But you should use a regular expression so it doesn't replace \\' with \' -- in this case the first backslash is escaping the second backslash, they're not escaping the quote. A negative lookbehind will prevent this.

import re
newstring = re.sub(r"(?<!\\)\\'", "'", originalstring)

score 1 · Answer 3 · answered Jun 07 '16 at 21:31

The solution is not bad. It seems ugly because the problem is ugly - you have corrupt data. It's certainly simple, elegant, and effective. It will only fail if the substring \\' (that's three characters, I'm not escaping anything) is present anywhere, and even then only if the number of consecutive slashes is even. So your options are:

Just do your current thing, but first check if r"\\'" in originalstring and throw an error if so. Easy, safe, probably fine.
Use a regex with a negative lookbehind for (\\\\)+ or something.
Catch errors and use the attributes of the errors to decide on a portion of the string to replace.

Check out this snippet:

import json
from json.decoder import JSONDecodeError

s = r'"\'"'
print(s)
try:
    print(json.loads(s))
except JSONDecodeError as e:
    print(vars(e))

Output:

"\'"
{'msg': 'Invalid \\escape', 'colno': 2, 'doc': '"\\\'"', 'pos': 1, 'lineno': 1}

Only in Python 3 does the exception have those attributes; in Python 2 all you'll get is a `ValueError` exception, and you are stuck with parsing the text message for this info. — Martijn Pieters, Jun 07 '16 at 21:42
You can use `json.JSONDecodeError` by the way, you don't need to import it from `json.decoder`. — Martijn Pieters, Jun 07 '16 at 21:43

Dealing with mis-escaped characters in JSON

3 Answers3