2

With the stock Python 3.5-3.x regular expression engine, I have exhaustively tested that the regex

re.compile(r"[\x00-\x7F]", re.UNICODE)

matches all single characters with code points U+0000 through U+007F, and no others, and similarly, the regex

re.compile(r"[^\x00-\x7F]", re.UNICODE)

matches all single characters with code points U+0080 through U+10FFFF, and no others. However, what I do not know is whether this is guaranteed or just an accident. Have the Python maintainers made any kind of official statement about the meaning of range expressions in regex character classes in Unicode mode?

The official re module documentation is fairly vague about the exact semantics of ranges, and in other regex implementations, e.g. POSIX BREs and EREs, the interaction between range expressions and characters outside the ASCII range is explicitly unspecified.

Sunny Patel
  • 7,830
  • 2
  • 31
  • 46
zwol
  • 135,547
  • 38
  • 252
  • 361
  • You do not need `re.UNICODE` in Python 3.x, it is default. It only affects shorthand character classes like `\s`, `\d`, `\w`, word boundaries `\b`. – Wiktor Stribiżew Jul 19 '18 at 14:08
  • @WiktorStribiżew Yes, I know. It is included for explicitness. – zwol Jul 19 '18 at 16:01
  • So, what is your question? How Python `re` differs from POSIX regex engine regarding the Unicode code point ranges? Please clarify what exactly should be included in the answer. Right now, it sounds as "Yes, Python `re` just works like that".. – Wiktor Stribiżew Jul 19 '18 at 16:02
  • @WiktorStribiżew The question is whether or not the observed behavior is guaranteed; in other words, whether application code can rely on it continuing to do this in the future. I don't know how I could make it any clearer. – zwol Jul 19 '18 at 16:14
  • 1
    Why it should be an accident? There are no characters beyond `U+10FFFF`. As long as this continues to be the upper boundary, regex doesn't match more. And what's the problem here? Developer presumably had this assumption: *a character out of `\x80-\x7f` range*. So it doesn't matter what character it is. Just it shouldn't be in that range. – revo Jul 19 '18 at 16:22
  • So, you want someone, not necessarily a Python author, promise that negated character classes like `[^\x00-\x7F]` will match chars from outside the BMP plane in all future versions of Python? It is like forseeing the future. – Wiktor Stribiżew Jul 19 '18 at 16:22
  • 1
    @WiktorStribiżew Yes, it is exactly a promise of consistent future behavior, or an explicit refusal to make such a promise, that I am looking for. This is one of the major functions of standards and documentation for programming languages. It is not an abnormal thing at all in my experience. – zwol Jul 19 '18 at 16:28
  • @revo I do expect the artificial upper limit on Unicode to be removed eventually (at the present rate of allocation, we will run out of code points in about 600 years) but that's not the point. The point is that _some_ regex engines don't make any promises at all about what `[\x00-\x7f]` will or will not match, and in fact explicitly warn you not to rely on it to stay the same from version to version, and I want to know whether Python's is one of those. – zwol Jul 19 '18 at 16:30
  • Are we talking about `[\x00-\x7f]` or `[^\x00-\x7f]`? – revo Jul 19 '18 at 16:32
  • @revo We are talking about both, like it says in the question. – zwol Jul 19 '18 at 16:33
  • The first character range is explicit `[\x00-\x7f]`. It doesn't have anything to do with Unicode. It's a hexadecimal range. The second too is explicit. It's a negation. It doesn't have any thing to do with Unicode again. It means *all* characters but `\x00-\x7f`. What *all* means has a known lower boundary `\x80`. So I assume what made you to worry about is the upper boundary.... – revo Jul 19 '18 at 16:46
  • ... If it is right then you shouldn't be worry. It doesn't have anything to do with `[^\x00-\x7f]`. Engines don't translate it as *a character between `\u0080` and `\u10FFFF`*. They simply translate it as *a character not between `\x00` and `\x7f`*. – revo Jul 19 '18 at 16:46
  • @revo I just wanted to note that negated character classes do not necessarily work that way. In VIM, a negated character class never matches linebreaks (unless prepended with ``\_``), and in ECMAScript 5, negated character classes did not match chars outside the BMP plane. – Wiktor Stribiżew Jul 19 '18 at 17:10
  • @WiktorStribiżew VIM is a line editor, like sed and awk it has a default behavior: *it sees every thing in one line*. To change the behavior and go beyond one line you should explicitly say it. It's not about the regex, it's about the program. – revo Jul 19 '18 at 17:48
  • @WiktorStribiżew I did a test on IE 11 minutes ago it doesn't match `` as a whole character but matches as separate characters which is expected and totally corresponds to a negated character class. – revo Jul 19 '18 at 17:50
  • @revo I know it looks explicit to you, but I swear to Ghod, some regex engine specifications _don't specify_ what either `[\x00-\x7f]` or `[^\x00-\x7f]` matches, leaving open the possibility that either of them could match _any_ random subset of Unicode, and that what they match could change from release to release, and that is the point of the question: is Python one of those engines? – zwol Jul 19 '18 at 18:45
  • @revo And to be clear, it doesn't matter what the things on either side of the range `-` are. I could have asked the exact same question about `[a-z]` or `[\u0080-\u00FF]` or `[ᄀ-ፚ]` and if we were talking about POSIX REs the answer would still be "it could match any random subset of Unicode". – zwol Jul 19 '18 at 18:51
  • Why should you think about a *random subset* while you are obviously saying `a` to `z`? – revo Jul 19 '18 at 19:04
  • @revo It is not at all obvious what characters are included in "from 'a' to 'z'." Spaniards will tell you that that set includes 'ñ', for instance. So POSIX throws up its hands and says that the characters matched by range expressions are unspecified (in anything other than the "C" locale, which limits you to ASCII). – zwol Jul 19 '18 at 19:07
  • [*Latin small letter*](https://unicode-table.com/en/007A/) `z` has a unique codepoint, regardless of a language alphabet. Other characters are no exception. Regular Expression engines are language agnostic. For language alphabets, engines may provide a syntax to provide access to [Unicode blocks or scripts](https://www.regular-expressions.info/unicode.html) (See sections at bottom). – revo Jul 19 '18 at 19:19
  • When you say `a` to `z`, in a programming sense, you literally mean `\u0061` to `\u007A` in Unicode table. – revo Jul 19 '18 at 19:22
  • @revo If you don't believe me that POSIX says what it says, read it for yourself, I provided a link. (Rule 7 for bracket expressions. Also please be aware that a "collating element" and a "character" are not the same thing.) – zwol Jul 19 '18 at 19:37
  • Engines aren't supposed to be POSIX-compliant. [POSIX.2 leaves some implementation specifics undefined... Perl regexes have become a de facto standard](https://en.wikipedia.org/wiki/Regular_expression). Python `re.I`: [...expressions like `[A-Z]` will match lowercase letters, too. This is not affected by the current locale.](https://docs.python.org/2/library/re.html#re.IGNORECASE). Python `re.L`: [Make `\w`, `\W`, `\b`, `\B`, `\s` and `\S` dependent on the current locale.](https://docs.python.org/2/library/re.html#re.LOCALE) – revo Jul 19 '18 at 19:58
  • @revo I can see that neither of us is going to convince the other, so can we please just drop this debate? I still need an answer to the question I actually asked, whether or not you think it makes sense to ask the question in the first place. – zwol Jul 19 '18 at 20:16
  • There is no debate here. Your last edit was saying *it's not helpful*, in fact it is as I'm giving you some insights into the question. Not all engines are POSIX-compliant. Locales might have impact on character ranges in regular expressions. I found some questions around this topic, likely to be dupes. – revo Jul 19 '18 at 20:51
  • Please see both questions [here](https://stackoverflow.com/questions/11925537/should-we-consider-using-range-a-z-as-a-bug) and [here](https://stackoverflow.com/questions/50302910/regex-a-z-do-not-recognize-local-characters). – revo Jul 19 '18 at 20:54
  • 1
    I think the answer to this question is that there is neither an exact specification of the `re` module's range semantics, nor is there a promise not to include such a specification in the future. But it's hard to find any references establishing that something does not exist. – kaya3 Dec 06 '19 at 16:51

0 Answers0