2

I've created a dictionnary with Python but I've got problems with extended Ascii codes.

The loop that creats the dictionnary is : (ascii number 128 to 164 : é,à etc)

#extented ascii codes
i = 128
while i <= 165 :
    dictionnary[chr(i)] = 'extended ascii'
    i = i + 1

But when I try to use dictionnary :

    >>> dictionnary['è']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '\xc3\xa8'

I've got # -- coding: utf-8 -- in the header of the python script. I've tried encode,decode etc but the result is always bad.

To understand what happens, I've tried :

>>> ord('é')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

and

    >>> ord(u'é')
233

I'am confused with ord(u'é') because 'é' is number 130 in extended ascii table and not 233.

I understand that extended ascii codes contains "two characters" but I don't understand how to solve the problem with dictionnary ?

Thanks in advance ! :-)

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
lilawood
  • 2,263
  • 5
  • 22
  • 27
  • 2
    There is no such thing as "extended ASCII". there are a lot of encodings (cpXXXX in Windows, latinXX, iso-8859-XX and others in the real world) where 247 can mean different things. – glglgl Jan 21 '14 at 09:37
  • Extended Ascii is the characters in the range 128 and above. Ascii = 0-127, Extended Ascii = 128-255. This dates back to the 60ies and 70ies. Now it is not important except for its residual effects like when you can't print out characters above 128 but you can for less than 128. Dates back to dumb terminals. – M T Head Aug 01 '17 at 00:24

1 Answers1

4

Use unichr instead of chr. The function chr produces a string containing a single byte, whereas unichr produces a string containing a single unicode character. Finally, do lookups using unicode characters too: d[u'é'] because d['é'] will look up the utf-8 encoding of é.

You have 3 things in your code: a latin-1 encoded str, a utf-8 encoded str, and a unicode string. Getting it clear in your head which you've got at any point in time requires a lot of knowledge about how Python works and a decent understanding of Unicode and encodings.

No answer about encodings and Unicode is complete without a link to Joel Spolsky's article on the matter: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)