The JSON standard defines specific set of valid 2-character escape sequences: \\
, \/
, \"
, \b
, \r
, \n
, \f
and \t
, and one 4-character escape sequence to define any Unicode codepoint, \uhhhh
(\u
plus 4 hex digits). Any other sequence of backslash plus other character is invalid JSON.
If you have a JSON source you can't fix otherwise, the only way out is to remove the invalid sequences, like you did with str.replace()
even if it is a little fragile (it'll break when there is an even backslash sequence preceding the quote).
You could use a regular expression too, where you remove any backslashes not used in a valid sequence:
fixed = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', inputstring)
This won't catch out an odd-count backslash sequence like \\\
but will catch anything else:
>>> import re, json
>>> broken = r'"JSON string with escaped quote: \' and various other broken escapes: \a \& \$ and a newline!\n"'
>>> json.loads(broken)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.5/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 34 (char 33)
>>> json.loads(re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'', broken))
"JSON string with escaped quote: ' and various other broken escapes: a & $ and a newline!\n"