Why doesn't unicodedata recognise certain characters?

Question

In Python 2.7 at least, unicodedata.name() doesn't recognise certain characters.

>>> from unicodedata import name
>>> name(u'\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> name(u'a')
'LATIN SMALL LETTER A'

Certainly Unicode contains the character \n, and it has a name, specifically "LINE FEED".

NB. unicodedata.lookup('LINE FEED') and unicodedata.lookup(u'LINE FEED') both give a KeyError: undefined character name.

On my machine, using python3.4 `name('\n')` fails, but `'\N{LINE FEED}'` works as well as `lookup('LINE FEED')`. On python2 all fails. — Bakuriu, Jul 03 '14 at 11:54
@AaronDigulla: no, `\n` has no name (other than ``). `LINE FEED` is an *alias* instead. — Martijn Pieters, Jul 03 '14 at 12:39

Martijn Pieters · Accepted Answer · 2014-07-03T12:48:18.563

19

The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).

Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

edited Jul 03 '14 at 12:48

answered Jul 03 '14 at 12:06

Martijn Pieters

1,048,767
296
4,058
3,343

1

:) ... Here is a screenshot of using pure py3.8, py3.9 interpreters, nothing installed, with the result I mentioned. https://imgur.com/I9bYX1r – Mazyod Sep 07 '21 at 12:11
I just realized, perhaps by saying "we have this issue showing", it would be understood as "the exact same issue with the same characters". Although I do mention a certain string, as in different characters seem to have this issue in later version. I guess it's just as more characters are defined over time, one has to guard against them in code. – Mazyod Sep 07 '21 at 12:16
4

@Mazyod: I misunderstood what you meant by your error message, I thought you meant you got an attribute error for the function `name()` on the module. You are getting a different issue: Python 3.8 and Python 3.9 bundle different Unicode standards, 12.1.0 and 13.0.0, respectively. See the [`unicodedata.unidata_version` attribute](https://docs.python.org/3/library/unicodedata.html#unicodedata.unidata_version). `SMILING FACE WITH TEAR` is [new in Unicode 13](https://unicode.org/emoji/charts-13.0/emoji-released.html). – Martijn Pieters Sep 07 '21 at 12:51
The characters 0000-001F and 007F-009F are Control chars, and unicodedata.name() lookup throws a ValueError on them. – smci May 24 '23 at 02:40
@smci: yes, because all codepoints with a name that starts with `<` are ignored, no name is found and that results in a `ValueError`. – Martijn Pieters Jun 08 '23 at 18:17

Why doesn't unicodedata recognise certain characters?

1 Answers1

Linked

Related