Remove utf-8 literals in a string python

Question

I'm new to python,I have a string like:

s= 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'

I want to remove all the unicode literals in a string like:

'\xc3\x82\xc2\xae'

I need output like:

'HDFC FTAE Greater China'

Can anyone help me with this?

Thank you

Python 2 and Python 3 differ in syntax for strings. An accurate answer needs to know if `s` is a Python 2 byte string or a Python 3 Unicode string. — Mark Tolonen, Aug 07 '18 at 00:55
It looks like your data is [mojibake](https://en.wikipedia.org/wiki/Mojibake). What you have is `'HDCF® FTAE® Greater China'` double-encoded as UTF-8. — Mark Tolonen, Aug 07 '18 at 01:02

score 4 · Accepted Answer · answered Aug 07 '18 at 01:14

On Python 2 (default string type is bytes):

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.decode('ascii',errors='ignore').encode('ascii')
'HDCF FTAE Greater China'

On Python 3 (default string type is Unicode):

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.encode('ascii',errors='ignore').decode('ascii')
'HDCF FTAE Greater China'

Note that the original string is a mojibake. Ideally fix how the string was read, but you can undo the damage with (Python 3):

>>> s.encode('latin1').decode('utf8').encode('latin1').decode('utf8')
'HDCF® FTAE® Greater China'

The original string was double-encoded as UTF-8. This works by converting the string directly 1:1 back to bytes¹, decoding as UTF-8, then converting directly back to bytes again and decoding with UTF-8 again.

Here's the Python 2 version:

>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> print s.decode('utf8').encode('latin1').decode('utf8')
HDCF® FTAE® Greater China

¹This works because the latin1 codec is a 256-byte encoding and directly maps to the first 256 Unicode codepoints.

Felk · Answer 2 · 2018-08-06T14:40:33.767

3

If your goal is to limit the string to ASCII-compatible characters, you can encode it into ASCII and ignore unencodable characters, and then decode it again:

x = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
print(x.encode("ascii", "ignore").decode("utf-8"))

produces HDCF FTAE Greater China.

Check out str.encode() and bytes.decode()

edited Aug 06 '18 at 14:40

answered Aug 06 '18 at 12:02

Felk

7,720
2
35
65

thanks for reply,it's showing like Error:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) – Narendra Lucky Aug 06 '18 at 12:10
Works on Python 3.6 with the input you supplied – Felk Aug 06 '18 at 12:16
Yeah indeed, I didn't test it with the indentation posted here. Fixed it by putting it on two separate lines – Felk Aug 06 '18 at 14:40

score 2 · Answer 3 · answered Aug 06 '18 at 12:18

You can filter your string using the string.printable function to check whether your characters can be printed:

import string

s= 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'

printable = set(string.printable)
s = "".join(filter(lambda c: c in printable, s))
print(s)

Output:

HDCF FTAE Greater China

Reference to this question.

score 0 · Answer 4 · answered Aug 06 '18 at 12:11

0

May be this help,

s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
d = ''.join([i for i in s if ord(i) < 127])
print(d)
# OUTPUT as: HDCF FTAE Greater China

answered Aug 06 '18 at 12:11

utks009

573
4
14

Remove utf-8 literals in a string python

4 Answers4