Why encoding in utf-8 still results in ascii?

Question

As per this code:

# coding=utf-8
import sys
import chardet

print(sys.getdefaultencoding())

a = 'abc'

print(type(a))
print(chardet.detect(a))

b = a.decode('ascii')

print(type(b))


c = '中文'

print(type(c))
print(chardet.detect(c))


m = b.encode('utf-8')
print(type(m))
print(chardet.detect(m))

n = u'abc'

print(type(n))

x = n.encode(encoding='utf-8')

print(type(x))
print(chardet.detect(x))

I use utf-8 to encode n but the result still show the result is ascii.

So I want to know, what is relation between utf-8, ascii and unicode.

i run with python2.

===================result=================================

=======================end result =============================

In python 3, all strings are unicode by default. The 'encode' method on `str`ings create a `bytes` object with the specified encoding. — Charles Langlois, Jan 24 '18 at 02:11
I got an error trying to run this code in Python 3. Python 3 is handled completely in utf-8 by default. This is one of the main benefits of switching to Python 3, encoding issues are _mostly_ solved by default. See [this](https://stackoverflow.com/a/30885015/1215344) answer for more detail — james-see, Jan 24 '18 at 02:13
I think the issue is that utf-8 encoding is identical to ascii encoding for the first 127 characters, so `chardet` cannot differentiate the two for a string containing only ascii characters? — Charles Langlois, Jan 24 '18 at 02:16
There is no point in "detecting" the character encoding of text that you have encoded. It is what you made it. It's a bit like asking why when you write "Hallo" in German and ask someone what language it is, they say Norweign. The answers explain. — Tom Blodget, Jan 24 '18 at 18:10

score 2 · Answer 1 · answered Jan 24 '18 at 02:12

UTF-8 is actually a variable-width encoding, and it just so happens that ASCII characters will map directly in UTF-8.

Since your UTF-8 string contains only ASCII characters, the string is, well honestly both an ASCII and UTF-8 string.

This visual might help:

>>> c = '中文abc中文'
>>>
>>>
>>> c
'中文abc中文'
>>> c.encode(encoding="UTF-8")
b'\xe4\xb8\xad\xe6\x96\x87abc\xe4\xb8\xad\xe6\x96\x87'

Notice how the "abc" in the UTF-8 string are only single-byte? They are still the same bytes as their ascii counterparts!

score 0 · Answer 2 · answered Jan 24 '18 at 02:29

UTF-8 encoding is such that characters 0-127(unicode codepoints U+0000 to U+007F) are the corresponding ascii characters and are encoded the same way. charset.detect thus naturally counfounds a string containing only those characters as ascii encoded, since in effect it is...

The u'...' notation in python 3 is there only for retrocompatibility, and is the same as normal string notation. So u'abc' is the same as 'abc'.

score 0 · Answer 3 · answered Jan 24 '18 at 04:10

It's because the designers of both Unicode and UTF-8 were brilliant and managed to achieve an impressive feat of backwards compatibility.

It started with the Latin-1 character set, which defined 256 characters the first 128 of which were taken directly from ASCII. Each of these characters fit into a single byte.

Unicode built an expanded character set, and it started by stating that the first 256 codepoints would be the characters from Latin-1. This meant that the first 128 codepoints retained the same numeric value they had in ASCII.

Then came UTF-8, which utilized a variable bit length encoding. Characters which took more than a single byte were signified by having the upper bit of each byte set. This meant that the bytes with their upper bit clear would all be single byte characters. Since ASCII also has the upper bit clear, it means that the encoding for those characters are identical between ASCII and UTF-8!

Why encoding in utf-8 still results in ascii?

3 Answers3

Linked