it has been suggested that this question is a duplicate of 6269765. I did not use any b'' literals in either the original code or the minimal. see the embeDded edits below:
I have reduced a problem I have been having today down to this minimal Python 3 code:
x='\xc3\xb3'
print(''.join([hex(ord(c))[2:] for c in x]))
print(''.join([hex(c)[2:] for c in x.encode()]))
When I run this code I get:
c3b3
c383c2b3
Is str.encode() really supposed to change UTF-8 character ó (LATIN SMALL LETTER O WITH ACUTE) to two characters ó (LATIN CAPITAL LETTER A WITH CIRCUMFLEX and SUPERSCRIPT THREE)?
edit:
No ó was entered as one commenter suggested. Only ó was entered in some cases and was read in from a text file in others. The text file involved was in the original problem. It was the system dictionary file of the current version of Ubuntu. The file is dated 23 Oct 2011 and has a filesystem path as seen in command examples of the original question.
The original problem involved encountering the word Asunción at line 1053 of that file. The ó character in Asunción has the byte sequence C3B3 which is described in the FileFormat.Info UTF-8 lookup table as LATIN SMALL LETTER O WITH ACUTE (described here for readers unable to properly read Unicode text.
No b'' literals were used in any code, neither the original, nor the minimal.
The nature of the problem was discovered as UTF characters being changed from the ó character to ó. This involved changing c3b3 to c383c2b3. The dictionary file literally contained the two bytes c3b3 which display ó as expected and as described in that UTF-8 table. The original problem was an exception being raised due to the change in length.
The use of str.encode()
was made to try to solve the problem and to discover its source. It is believed that something, somewhere, did something similar to str.encode()
.
The minimal code to show this problem at first was:
x='Asunción'
print(' '.join([hex(ord(c))[2:] for c in x]))
print(' '.join([hex(c)[2:] for c in x.encode()]))
but I found that many people were unable to see the lower case acute o so I changed it to hexadecimal codes (\x) which had the same hexadecimal verification output both before and after the str.encode()
as the first minimal example just above with the literal full word Asunción
.
Then I decided it would be more minimal to use the affected character alone and no spaces in the hexadecimal output.
end of edit, back to original post:
This UTF-8 character was encountered in the American English dictionary file on the latest American English Ubuntu edition named /usr/share/dict/american-english
. You can see the first word in that file with this sequence with the command:
head -1053 /usr/share/dict/american-english|tail -1
You can see it in hexadecimal with the command:
head -1053 /usr/share/dict/american-english|tail -1|od -Ad -tx1
Character descriptions were obtained from here. I am running Python 3.5.2 compiled on GCC 5.4.0 on Ubuntu 16.04.1 LTS updated 2 days ago.
edit:
is the correct answer here to avoid bytes totally and not use str.encode()
? or is there a better answer?