2

I'm working with International Phonetic Alphabet (IPA) symbols in my Python program, a rather strange set of characters whose UTF-8 codes can range anywhere from 1 to 3 bytes long. This thread from several years ago basically asked the reverse question and it seems that ord(character) can retrieve a decimal number that I could convert to hex and thereafter to a code point, but the input for ord() seems to be limited to one byte. If I try ord() on any non-ASCII character, like ɨ for example, it outputs:

TypeError: ord() expected a character, but a string of length 2 found

With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.

Community
  • 1
  • 1
Arcaeca
  • 227
  • 3
  • 15

3 Answers3

5

With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.

You can only find the unicode code point of a unicode object. To convert your byte string to a unicode object, decode it using mystr.decode(encoding), where encoding is the encoding of your string. (You know the encoding of your string, right? It's probably UTF-8. :-) Then you can use ord according to the instructions you already found.

>>> ord(b"ɨ".decode('utf-8'))
616

As an aside, from your question it sounds like you're working with the strings in their UTF-8 encoded bytes form. That's probably going to be a pain. You should decode the strings to unicode objects as soon as you get them, and only encode them if you need to output them somewhere.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Hi BrenBarn, thanks for the reply, but it's still raising the same `TypeError`. I've tried everything I know of to make sure that the character is in UTF-8 to begin with so it can be decoded. Does this method still work when the character is being read in from a file, not hardcoded like that? I have test script [here](http://pastebin.com/jgbAWkPT) and the input file [here](http://pastebin.com/w74jkhhL). Sorry if this isn't the write place to ask for clarification; I'm new here at SO – Arcaeca Aug 12 '16 at 16:21
  • 1
    @Arcaeca: Your error is probably because you are reading in bytes from the file, grabbing a single byte, and then trying to decode it. But if you grab one byte of a multi-byte UTF-8 sequence, the decoding will fail. As I said in my answer, you shouldn't attempt to encode and decode individual characters. Decode *the entire file* right when you read it in, for instance by using [`io.open`](https://docs.python.org/2/library/io.html#io.open) instead of the built-in open function. – BrenBarn Aug 12 '16 at 17:14
2

This is actually a bug in Python 2, depending on how it was built, for unicode characters outside the BMP (>= 0xFFFF); see: https://bugs.python.org/issue8670#msg105656

For example this works:

>>> ord('\uffff')
65535
>>> len('\uffff')
1

But this does not:

>>> ord(u'\U00010000')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

And even more surprisingly:

>>> len(u'\U00010000')
2

This is because there used to be "narrow" builds of Python versus "wide" builds. In "narrow" builds, unicode strings are represented internally with UCS2 (and thus use less memory, but have to use two UCS2 characters ("surrogate pairs") to represent characters above U+FFFF), whereas in "wide" builds UCS4 is used internally for unicode strings and you won't have this problem.

In newer versions of Python 3 (I think since 3.2 or 3.3 but I can't remember) this is no longer a problem and the situation is much better. The easiest way to check is with sys.maxunicode which will be 0xffff on narrow builds.

This answer demonstrates how to extract the ordinal from surrogate pairs in narrow builds.

Community
  • 1
  • 1
Iguananaut
  • 21,810
  • 5
  • 50
  • 63
1
>>> u'ɨ'
u'\u0268'
>>> u'i'
u'i'
>>> 'ɨ'.decode('utf-8')
u'\u0268'
Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89