-1

I am working on simple python script. Unluckily some data I must work with, are stored as follows:

My data

trouble_string = '{\"N\": \"Centr\\u00e1lna nervov\\u00e1 s\\u00fastava\"}'

What I want to achieve

I want to convert string in following format.

decoded_string = '{"N": "Centrálna nervová sústava"}'

Problem

You can see there are accented letters numerically encoded. Is there any smart way how to decode this string?

What I tried

bytes(s, encoding='utf-8').decode(encoding='utf-16')
# outputs: '䌢湥牴畜〰ㅥ湬\u2061敮癲癯畜〰ㅥ猠畜〰慦瑳癡≡'

bytes(s, encoding='utf-16').decode(encoding='utf-8')
# outputs: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
tripleee
  • 175,061
  • 34
  • 275
  • 318
Fusion
  • 5,046
  • 5
  • 42
  • 51
  • The assumption that anything here is UTF-16 is false. This is just regular backslash escaping. – tripleee Sep 25 '19 at 14:54
  • @tripleee Looks like utf-8 and utf-16 have different way of escaping - Check out https://convertcodes.com/unicode-converter-encode-decode-utf/. – Fusion Sep 25 '19 at 15:27
  • 2
    No, the page you are linking to is confused. Neither UTF-8 nor UTF-16 has any backslash escaping mechanism. The notation `\u1234` is simply Python's way of representing a Unicode character (and at this point it has no encoding at all, neither UTF-8 nor UTF-16). JSON is by definition UTF-8 but uses precisely this representation, too, so my vote is on deceze's answer. – tripleee Sep 25 '19 at 15:29
  • Maybe see also https://stackoverflow.com/questions/32499846/is-utf-16-compatible-with-utf-8 – tripleee Sep 26 '19 at 04:11

2 Answers2

3

It looks like JSON, so decode it and the encode it in a way that you prefer?

>>> import json
>>> json.loads('{\"N\": \"Centr\\u00e1lna nervov\\u00e1 s\\u00fastava\"}')
{'N': 'Centrálna nervová sústava'}
>>> json.dumps(json.loads('{\"N\": \"Centr\\u00e1lna nervov\\u00e1 s\\u00fastava\"}'), ensure_ascii=False)
'{"N": "Centrálna nervová sústava"}'
deceze
  • 510,633
  • 85
  • 743
  • 889
-1
trouble_string = '{\"N\": \"Centr\\u00e1lna nervov\\u00e1 s\\u00fastava\"}'
result = trouble_string.encode().decode("unicode-escape")

Quote from docs:

unicode_escape - Produce a string that is suitable as Unicode literal in Python source code.

Olvin Roght
  • 7,677
  • 2
  • 16
  • 35
  • Your `trouble_string` doesn't contain all the backslashes you put there, though; and the code you posted doesn't actually produce Unicode literals in the result. – tripleee Sep 25 '19 at 15:04
  • @tripleee, I've copied it from question, so it contains right what it should contain. – Olvin Roght Sep 25 '19 at 15:10