9

I'm having trouble assigning unicode strings as names for a namedtuple. This works:

a = collections.namedtuple("test", "value")

and this doesn't:

b = collections.namedtuple("βαδιζόντων", "value")

I get the error

Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib64/python3.4/collections/__init__.py", line 370, in namedtuple
        result = namespace[typename]
KeyError: 'βαδιζόντων'

Why is that the case? The documentation says, "Python 3 also supports using Unicode characters in identifiers," and the key is valid unicode?

Nemo
  • 2,441
  • 2
  • 29
  • 63
Thomas
  • 249
  • 1
  • 6
  • 1
    Something I noticed: It works fine if I leave out the ``ó``. Seems like a bug to me. – smheidrich May 28 '15 at 10:19
  • Interesting - I should have tested that myself. ó is the only character from the unicode "Greek Extended" block, so this might be relevant. But it would still disagree with what the documentation says. – Thomas May 28 '15 at 10:24
  • Upon closer inspection, what happens is that, for some reason, ``'ó'`` is ``'\xe1\xbd\xb9'`` in the UTF-8 encoded source file, but turns into ``'\xcf\x8c'`` in the code generated by ``namedtuple`` to generate its class. This definitely seems like a bug. – smheidrich May 28 '15 at 10:38
  • could you have a go on my suggestion and look if it works for you? – knitti May 28 '15 at 11:45

3 Answers3

6

The problem is specifically with the letter (U+1F79 Greek small letter omicron with oxia). This is a ‘compatibility character’: Unicode would rather you use ό instead (U+03CC Greek small letter omicron with tonos). U+1F79 only exists in Unicode in order to round-trip to old character sets that distinguished between oxia and tonos, a distinction that later turned out to be incorrect.

When you use compatibility characters in an identifier, Python's source code parser automatically normalises them to form NFKC, so your class name ends up with U+03CC in it.

Unfortunately collections.namedtuple doesn't know about this. The way it creates the new class instance is by inserting the given name into a bunch of Python code in a string, then executing it (yuck, right?), and extracting the class from the resultant locals dict using its name... the original name, not the normalised version Python has actually compiled, so it fails.

This is a bug in collections which may be worth filing, but for now you should use the canonical character U+03CC ό.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Arrg, now I understand! I've been bitten by these compatibility characters for Greek accented letters a number of times. At least that allows me to work around the problem. Thank you for your explanation! – Thomas May 28 '15 at 10:41
  • A reference to source code will be useful https://hg.python.org/cpython/file/661cdbd617b8/Lib/collections/__init__.py#l332 – Mazdak May 28 '15 at 10:44
2

That ó is U+1F79 ɢʀᴇᴇᴋ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴏᴍɪᴄʀᴏɴ ᴡɪᴛʜ ᴏxɪᴀ. Python identifiers are normalized as NFKC, and U+1F79 in NFKC becomes U+03CC ɢʀᴇᴇᴋ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴏᴍɪᴄʀᴏɴ ᴡɪᴛʜ ᴛᴏɴᴏs.

Interestingly, if you use the same string with U+1F79 replaced by U+03CC, it works.

>>> b = collections.namedtuple("βαδιζ\u03CCντων", "value")
>>>

The documentation for namedtuple claims that "Any valid Python identifier may be used for a fieldname". Both strings are valid Python identifiers, as can be easily tested in the interpreter.

>>> βαδιζόντων = 0
>>> βαδιζόντων = 0
>>>

This is definitely a bug in the implementation. I traced it to this bit in implementation of namedtuple:

namespace = dict(__name__='namedtuple_%s' % typename)
exec(class_definition, namespace)
result = namespace[typename] # here!

I guess that the typename left in the namespace dictionary by exec'ing the class_definition template, being a Python identifier, will be in NFKC form, and thus no longer match the actual value of the typename variable used to retrieve it. I believe simply pre-normalizing typename should fix this, but I haven't tested it.

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
1

Althoug there's already an accepted answer let me offer a

Fix of the problem

# coding: utf-8
import collections
import unicodedata


def namedtuple_(typename, field_names, verbose=False, rename=False):
    ''' just like collections.namedtuple(), but does unicode nomalization
        on names
    '''

    if isinstance(field_names, str):
        field_names = field_names.replace(',', ' ').split()
    field_names = [
        unicodedata.normalize('NFKC', name) for name in field_names]
    typename = unicodedata.normalize('NFKC', typename)

    return collections.namedtuple(
        typename, field_names, verbose=False, rename=False)


βαδιζόντων = namedtuple_('βαδιζόντων', 'value')

a = βαδιζόντων(1)

print(a)
# βαδιζόντων(value=1)
print(a.value == 1)
# True

What does it do?

using this namedtuple_() implementation normalized the names before handing them over to collections.namedtuple(), making it possible to have congruent names.

This is an elaboration on @R. Martinho Fernandes' idea of pre-nomalizing the names.

knitti
  • 6,817
  • 31
  • 42
  • Thank you, that is extremely helpful! I suspect it wouldn't solve my particular use case (which involves extracting a list of words from a text file and comparing it against a list of known words), but it's very good to have! – Thomas May 28 '15 at 12:53
  • it could help, depends on how/why you compare it... you could strip combining chracters from the NFKC form with a regexp and could complete the whole nomalization with a lower() – knitti May 28 '15 at 13:20