1

When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.

For example processing an image like this:

enter image description here

ends up with output:

"some text \u0141\u00f3\u017a"

It's confusing because if I add some code like this:

mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')

The output is ok:

"some text Ółźż"

What is more if I try to replace them by regex:

mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)

The output remain "broken":

"some text \u0141\u00f3\u017a"

This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:

https://docs.mathpix.com/?python#process-an-image

Will produce the following RAW output:

"\\(\\Delta=b^{2}-4 a c\\)"

Take a note that quotes are included in string - maybe this implies something to the case.

  1. Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?

  2. Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Wincij
  • 121
  • 10
  • Do you process the incoming JSON with "json.load" (or "json.loads")? If so, can you show a sample of the JSON as properly formatted text in the question? – Michael Butscher Apr 24 '22 at 23:26
  • No I process it with `mystr = json.dumps(snip_request.json()['text'])` where snip_request is set by `requests.post(srv_adress, files={...} , data={...} , headers={...})` – Wincij Apr 24 '22 at 23:48
  • 1
    This means you receive the JSON with "json()" already as Python data structure and convert a part of it back to JSON with the "dumps" call. You shouldn't do the latter. – Michael Butscher Apr 25 '22 at 03:28

1 Answers1

1

Python strings are unicode-encoded by default, so the string you have is different from the string you output.

>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a

The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:

>>> txt[-5:]
'u017a'
>>> txt[-6:] 
'\\u017a'
>>> txt[-6:-5] 
'\\'

What you should do to resolve it:

  • Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
  • Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'
Lukas Schmid
  • 1,895
  • 1
  • 6
  • 18