How to make Python treat literal string as UTF-8 encoded string

Question

I have some strings in Python loaded from a file. They look like lists, but are actually strings, for example:

example_string = '["hello", "there", "w\\u00e5rld"]'

I can easily convert it into an actual list of strings:

def string_to_list(string_list:str) -> List[str]:
    converted = string_list.replace('"', '').replace('[', '').replace(']', '').split(',')
    return [s.strip() for s in converted]
as_list = string_to_list(example_string)
print(as_list)

Which returns the following list of strings: ["hello", "there", "w\\u00e5rld"] The problem is the encoding of the last element of the string. It looks like this when I run print(as_list), but if I run

for element in as_list:
    print(element)

it returns

hello
there
w\u00e5rld

I dont know what happens to the first backslash, it seems to me like it is there to escape the second one in the encoding. How do I make Python just resolve the UTF-8 character and print "wørld"? The problem is that it is a string, not an encoding, so as_list[2].decode("UTF-8") does not work.

I tried using string.decode(), and I tried plain printing

@Tomerikoo: Amusingly, they found their own nonsense way to do all the stuff that question is asking for, but yeah, that question would have solved their entire problem from the get-go. — ShadowRanger, Mar 26 '23 at 13:30
@ShadowRanger I was about to just suggest it as a side not but after realizing that `ast` even solves the problem I think it can be used as a duplicate... I'm also pretty sure there's a duplicate somewhere regarding your comment about the `codecs` module — Tomerikoo, Mar 26 '23 at 13:32
Found it. Assuming they just have a string - `"w\\u00e5rld".encode('latin-1', 'backslashreplace').decode('unicode-escape')` will work — Tomerikoo, Mar 26 '23 at 13:41
@Tomerikoo: More directly (but requiring the import), you'd just do `codecs.decode("w\\u00e5rld", 'unicode-escape')`. You have to use the `codecs` module for the single-step, text-to-text approach, because you can't call `.decode` on a `str`. — ShadowRanger, Mar 26 '23 at 15:10

score 1 · Accepted Answer · answered Mar 26 '23 at 13:27

1

The correct way to decode that to a list of strings is not the insane set of string operations you're performing. It's just ast.literal_eval(example_string), which will handle Unicode escapes just fine:

    import ast
    
    example_string = '["hello", "there", "w\\u00e5rld"]'
    example_list = ast.literal_eval(example_string)
    for word in example_list:
        print(word)

which, assuming you have appropriate font support for the character, outputs:

hello
there
wårld

If you absolutely needed to just fix Unicode escapes, the codecs module can be used for unicode_escape decoding, but in this case, you have a legal Python literal in a string, and ast.literal_eval can do all the work.

answered Mar 26 '23 at 13:27

ShadowRanger

143,180
12
188
271

In this case, the string was likely generated with `json.dumps` with `ensure_ascii=True`. `json.loads(example_string)` -> `['hello', 'there', 'wårld']`. `ast.literal_eval` is normally needed when the strings are single-quoted and non-JSON. – Mark Tolonen Mar 26 '23 at 15:03
@MarkTolonen: True, either one will work. `json.loads` is faster IIRC, assuming the input is guaranteed JSON, which is more limited than Python literals. – ShadowRanger Mar 26 '23 at 15:08

How to make Python treat literal string as UTF-8 encoded string

1 Answers1