When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly
mystr.replace(x, x)
it goes just fine?Why the first method is working and
re.sub
fails? The code seems to be okay and it works fine in other script. What am I missing?