I don't see the point, why you have to convert from utf-8
.
From the unicode
docs:
UTF-8 uses the following rules:
If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.
You can convert it to ascii
, for instance:
u.encode('utf-8') = b"\xea\x80\x80abcd\xde\xb4 u'\\u2019'=\xe2\x80\x99"
u.encode('ascii', 'ignore') = b"abcd u'\\u2019'="
u.encode('ascii', 'replace') = b"?abcd? u'\\u2019'=?"
u.encode('ascii', 'xmlcharrefreplace') = b"ꀀabcd޴ u'\\u2019'=’"
u.encode('ascii', 'backslashreplace') = b"\\ua000abcd\\u07b4 u'\\u2019'=\\u2019"
From the re
docs:
Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.
re.A
re.ASCII
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching.
This is only meaningful for Unicode patterns, and is ignored for byte patterns.
Note that for backward compatibility, the re.U flag still exists
(as well as its synonym re.UNICODE and its embedded counterpart (?u)),
but these are redundant in Python 3 since matches are Unicode by default for strings
(and Unicode matching isn’t allowed for bytes).
Tested with Python:3.4.2