5

I would like to use regular expressions on UTF-32 codepoints and found this reference stating that std::regex_traits has to be defined by the user, so that std::basic_regex can be used at all. There seems to be no changes planned in the future for this.

  1. Why is this even the case?

  2. Does this have to do with the fact that Unicode says combined codepoint have to be treated equal to the single-code point representation (like the umlaut 'ä' represented as a single codepoint or with the a and the dots as two separate ones) ?

  3. Given the simplification that only single-codepoint characters would be supported, could this trait be defined easily or would this be either non-trivial nevertheless or require further limitations?

Ident
  • 1,184
  • 11
  • 25

2 Answers2

9
  1. Some aspects of regex matching are locale-aware, with the result that a std::regex_traits object includes or references an instance of a std::locale object. The C++ standard library only provides locales for char and wchar_t characters, so there is no standard locale for char32_t (unless it happens to be the same as wchar_t), and this restriction carries over into regexes.

  2. Your description is imprecise. Unicode defines canonical equivalence relationship between two strings, which is based on normalizing the two strings, using either NFC or NFD, and then codepoint-by-codepoint comparing the normalized values. It does not defined canonical equivalence simply as an equivalence between a codepoint and a codepoint sequence, because normalization cannot simply be done character-by-character. Normalisation may require reordering composing characters into the canonical order (after canonical (de)composition). As such, it does not easily fit into the C++ model of locale transformations, which are generally single-character.

    The C++ standard library does not implement any Unicode normalization algorithm; in C++, as in many other languages, the two strings L"\u00e4" (ä) and L"\u0061\u0308" (ä) will compare as different, although they are canonically equivalent, and look to the human reader like the same grapheme. (On the machine I'm writing this answer, the rendering of those two graphemes is subtly different; if you look closely, you'll see that the umlaut in the second one is slightly displaced from its visually optimal position. That violates the Unicode requirement that canonically equivalent string have precisely the same rendering.)

    If you want to check for canonical equivalence of two strings, you need to use a Unicode normalisation library. Unfortunately, the C++ standard library does not include any such API; you could look at ICU (which also includes Unicode-aware regex matching).

    In any case, regular expression matching -- to the extent that it is specified in the C++ standard -- does not normalize the target string. This is permitted by the Unicode Technical Report on regular expressions, which recommends that the target string be explicitly normalized to some normalization form and the pattern written to work with strings normalized to that form:

    For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters.… In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases… It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:

    • Putting the text to be matched into a defined normalization form (NFD or NFKD).
    • Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
    • Applying the matching algorithm on a code point by code point basis, as usual.
  3. The bulk of the work in creating a char32_t specialization of std::regex_traits would be creating a char32_t locale object. I've never tried doing either of these things; I suspect it would require a fair amount of attention to detail, because there are a lot of odd corner cases.


The C++ standard is somewhat vague about the details of regular expression matching, leaving the details to external documentation about each flavour of regular expression (and without a full explanation about how to apply such external specifications to character types other than the one each flavour is specified on). However, the fact that matching is character-by-character is possible to deduce. For example, in § 28.3, Requirements [re.req], Table 136 includes the locale method responsible for the character-by-character equivalence algorithm:

Expression: v.translate(c) Return type: X::char_type Assertion: Returns a character such that for any character d that is to be considered equivalent to c then v.translate(c) == v.translate(d).

Similarly, in the description of regular expression matching for the default "Modified ECMAScript" flavour (§ 28.13), the standard describes how the regular expression engine to matches two characters (one in the pattern and one in the target): (paragraph 14.1):

During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:

  1. if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);

  2. otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);

  3. otherwise, the two characters are equal if c == d.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Regarding #2: According to the following question's answers the unicode equivalence can in fact be affected if specified by the locale. The answers also list other regex behaviours that are affected by the locale and is therefore relevant in regard to #3: http://stackoverflow.com/questions/9043712/locale-specific-behavior-in-the-regex-library – Ident Nov 14 '15 at 15:33
  • @Ident: Not all of the answer you link is correct. The locale affects the matching of (single) characters with special character classes (`[:class:]`, `[.class.]` and `[=class=]`), and character ranges (`[a-z]`). It also effects matching single characters with each other if the `1case` or `collate` flags are set. But (at least in the standard regex dialects), the comparison is always with a single `CharT`; multicharacter codes in the target string are not reinterpreted as single codepoints... – rici Nov 14 '15 at 17:57
  • Ok thanks for the answers, one last question, let's assume I did not want to support proper ordering and just wanted to sort non-lexicographically by the numerical representation of the character and generally wanted everything to work like the "C" locale does when it comes to character classes etc., would this be easily doable or would those corner-cases come into play anyways that you mentioned in the initial answer? Could I mostly wrap the cha32_t regex_traits around the char regex_traits for this purpose? – Ident Nov 14 '15 at 18:47
  • ... Moved most of the commentary into the answer. – rici Nov 14 '15 at 19:23
  • @Ident: If your `wchar_t` is 32 bits, then it should all be pretty simple. If `wchar_t` is only 16 bits, then its locale might not correctly handle character types of unicode characters outside of the BMP. You'd have to try it. If you installed ICU, you could defer pretty much all the work to ICU functions, but then you could also use the ICU regular expression API instead of the C++ standard library. – rici Nov 14 '15 at 19:26
  • I am writing on an Open source cross-platform library, as developers we therefore can't assume anything platform specific :) Yes, this is one of the moments where this becomes a big pain. We also cannot install ICU, that's by far too large as a dependency. We would like a solution that use no dependencies (other than c++11) if possible. – Ident Nov 14 '15 at 21:34
  • @Ident: If you cannot assume anything, you cannot assume that a unicode locale is available, because c++11 does not require it to be available. Just about the only guarantee you have is that if `__STDC_ISO_10646__` is defined, then (certain) Unicode codepoints will fit in a `wchar_t`. The other guarantee is that you can store UTF8-encoded strings in a `std::basic_string`, but I gather that for some reason you don't want to use UTF-8... – rici Nov 14 '15 at 21:52
  • ... along with UTF-8, you have the standard `codecvt` conversions to convert between UTF-16, UTF-32 and UTF-8; those are not locale-dependent. So the basic low-level facilities are there, I guess. Maybe it is not very satisfactory. – rici Nov 14 '15 at 21:55
  • we decided we don't want UTF-8 as internal representation due to the problems with splitting/counting/editing the UTF-8 strings, be even UTF-8 regex support is not provided: http://stackoverflow.com/a/15895746/3144964 We use codecvt for conversion, I am not sure I understand how this is related to regex / locale, can you specify? Are you saying we should convert the String to the specific w_string type with that and then use regex on it? This does not sound very efficient but generally like a solution that would generally work. – Ident Nov 14 '15 at 21:57
  • @Ident: it's not related to regex or locale; it means that you can reliably convert between UTF-8 and UTF-32, regardless of platform-dependent locale support. Convenient UTF-8 regex support is not available, but you can use an external preprocessor with a Unicode database to create a byte-oriented regex from a regex with explicitly marked unicode codepoints. If you find that too much work, I again suggest you look at ICU; you only need to include what you need to include. – rici Nov 14 '15 at 22:01
  • @ident: Final comment: I've often seen the complaint about the difficulty of splitting/counting/editing UTF-8 strings, but I find it unconvincing, in part because you *always* need to think about multi-codepoint graphemes. (Not all scripts have composed graphemes.) So the difficulty is inherent in the datamodel. Handling the UTF-8 multibyte encoding is almost trivial in comparison, and you get most of what you need by using (multibyte) iterators instead of trying to array-index character strings (which is bad style even for single-byte character strings.) All that only IMHO. Good luck. – rici Nov 14 '15 at 22:12
  • Thanks for the input, I was initially going for using UTF-8 but was kind of convinced not to, after all UTF-8 would cause us a lot of work for us, but now it seems UTF-32 does as well if we want to properly do this. We are still exploring options so this question is affecting the outcome. But like you said: even with UTF-8 we would have the same issue regarding regex support, that you have mentioned: It does not work out of the box. The problem with wchar is that it is not directly unicode either and differently handled on Windows and Unix so I would want to avoid this maintenance hell. – Ident Nov 14 '15 at 22:21
0

I've just discovered a regex implementation which supports char32_t: http://www.akenotsuki.com/misc/srell/en/

It mimics std::regex API and is under BSD license.

igagis
  • 1,959
  • 1
  • 17
  • 27