How to convert string containing unicode escape \u#### to utf-8 string

Question

I am trying this since morning.

My sample.txt

choice = \u9078\u629e

Code:

with open('sample.txt', encoding='utf-8') as f:
    for line in f:
        print(line)
        print("選択" in line)
        print(line.encode('utf-8').decode('utf-8'))
        print(line.encode().decode('utf-8'))
        print(line.encode('utf-8').decode())
        print(line.encode().decode('unicode-escape').encode("latin-1").decode('utf-8')) # as suggested.

out:
choice = \u9078\u629e
False
choice = \u9078\u629e
choice = \u9078\u629e
choice = \u9078\u629e
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)

When I do this in ipython qtconsole:

In [29]: "choice = \u9078\u629e"
Out[29]: 'choice = 選択'

So the question is how can I read the text file containing the unicode escaped string like \u9078\u629e (I don't know exactly what it's called) and convert it to utf-8 like 選択?

note: `utf-8` an encoding. It maps `unicode strings` to `byte-strings` so that systems can communicate correctly, despite their in-memory structure of unicode strings. Unicode in-memory is implementation defined. `選択` is a unicode string, in-memory. It is not encoded, so it's not `utf-8`. When you write to either a file, or send data over the network, that is when encoding as `utf-8` comes into play. meaning this makes no sense: `.encode("latin-1").decode('utf-8'))` would be like "save image as format JPG named "output.jpg", then load "output.jpg" as an MP3-audio file. — ninMonkey, Sep 26 '19 at 16:20

Thierry Lathuille · Accepted Answer · 2018-03-16T08:38:55.887

4

If you read it from a file, just give the encoding when opening:

with open('test.txt', encoding='unicode-escape') as f:    
    a = f.read()
print(a)

# choice = 選択

with test.txt containing:

choice = \u9078\u629e

If you already had your text in a string, you could have converted it like this:

a = "choice = \\u9078\\u629e"
a.encode().decode('unicode-escape')
# 'choice = 選択'

edited Mar 16 '18 at 08:38

answered Mar 16 '18 at 08:18

Thierry Lathuille

23,663
10
44
50

Please test it by saving it to text file and reading. that's where problem arises. – Rahul Mar 16 '18 at 08:18
in text file, it gives `UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)` – Rahul Mar 16 '18 at 08:21
`a = "choice = \u9078\u629e";print(a)` even this gives `choice = 選択` The problem is how to do it from textfile. – Rahul Mar 16 '18 at 08:23
@Rahul I updated the answer for reading directly from a file. – Thierry Lathuille Mar 16 '18 at 08:30
@Aran-Fey: Oops.... you're right, I made a big mixup... Removed it, thanks! – Thierry Lathuille Mar 16 '18 at 08:38
The error of `latin-1` is coming up because you're using python 2 with no encoding, `line.encode()` so your system is doing "convert unicode to latin-1" (or often ascii) giving the error because latin-1 cannot represent that value. The unicode strings `'\u9078\u629e' == '選択'` are exactly equal. There's no need for any conversion in python 3. Python 2 is tricking you by silently erroring and implicitly en/decoding even when you're not asking for it. It's too short of a box to be able to explain why, but, if you use python 2, you get problems when not using `from future import unicode_literals` – ninMonkey Sep 26 '19 at 16:28

How to convert string containing unicode escape \u#### to utf-8 string

1 Answers1

Linked