22

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.

>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'

Certainly Unicode contains the character \n, and it has a name, specifically "LINE FEED".

NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.

smci
  • 32,567
  • 20
  • 113
  • 146
Hammerite
  • 21,755
  • 6
  • 70
  • 91

1 Answers1

19

The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).

Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

  • Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    :) ... Here is a screenshot of using pure py3.8, py3.9 interpreters, nothing installed, with the result I mentioned. https://imgur.com/I9bYX1r – Mazyod Sep 07 '21 at 12:11
  • I just realized, perhaps by saying "we have this issue showing", it would be understood as "the exact same issue with the same characters". Although I do mention a certain string, as in different characters seem to have this issue in later version. I guess it's just as more characters are defined over time, one has to guard against them in code. – Mazyod Sep 07 '21 at 12:16
  • 4
    @Mazyod: I misunderstood what you meant by your error message, I thought you meant you got an attribute error for the function `name()` on the module. You are getting a different issue: Python 3.8 and Python 3.9 bundle different Unicode standards, 12.1.0 and 13.0.0, respectively. See the [`unicodedata.unidata_version` attribute](https://docs.python.org/3/library/unicodedata.html#unicodedata.unidata_version). `SMILING FACE WITH TEAR` is [new in Unicode 13](https://unicode.org/emoji/charts-13.0/emoji-released.html). – Martijn Pieters Sep 07 '21 at 12:51
  • The characters 0000-001F and 007F-009F are Control chars, and unicodedata.name() lookup throws a ValueError on them. – smci May 24 '23 at 02:40
  • @smci: yes, because all codepoints with a name that starts with `<` are ignored, no name is found and that results in a `ValueError`. – Martijn Pieters Jun 08 '23 at 18:17