4

In certain flavours of regular expression, inside a square bracket expression, the = symbol is a special character that is used as a delimiter for enclosing any one the elements within an equivalence class. The documentation says the following:

An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. Only primary equivalence classes shall be recognized. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". If the collating element does not belong to an equivalence class, the equivalence class expression shall be treated as a collating symbol.

I am not quite sure what this means. If a, à and â belong to the same equivalence class, does that mean we wish to specify that the regular expressions "[ab]", "[àb]" and "[âb]" are equivalent? Then what is the purpose of using the [= =] delimiter, since we might as well write "[aàâb]"?

I understand what an "equivalence class" means in its general definition, but I am unable to grasp its meaning in this context.

Enlico
  • 23,259
  • 6
  • 48
  • 102

1 Answers1

2

Essentially, for example, [=a=] means "all the characters that belong to the equivalence class which a belongs to. If a and à form an equivalence class which only contains those two, then [=a=] and [=à=] are both the same as (obviously written inside a […]). But if the equivalence class also contains â, then all of [=a=], [=à=], and [=â=] mean aàâ (again, written in […]).

What characters an equivalence class contains depends on the locale, but if the locale doesn't define a given equivalence class, let's say [=a=], then the collating sequence with the same name will be used, i.e. [.a.], which normally is just the same as a, as the locales usually include normal characters as collating sequences.

Reference: Mastering Regular Expressions, 3rd Edition, page 128, which is an excellent book on regular expression, written by someone who knows regex down to the smallest bit.

Enlico
  • 23,259
  • 6
  • 48
  • 102
  • How are these equivalence classes defined? How do we know which are the characters that belong to the equivalence class of `a` when we write `[=a=]` in our regular expression? –  Jun 01 '21 at 13:05
  • 1
    You don't know, or rather you don't want to have to find out. You might have something like `"[=$c=]"` where `$c` is user input and you want to match anything that looks like the same char. Equivalence classes are defined in the Unicode data set and you'd need a lot of code to do that lookup yourself. – LHMathies Jun 01 '21 at 13:14
  • @LHMathies, is that an answer to the OP comment? – Enlico Jun 01 '21 at 13:15
  • @ghulam_e_khuda, any feedback? – Enlico Jun 04 '21 at 07:15
  • 1
    @Enlico, yes. You answered how the system finds out, I addressed why there are cases where the programmer just wants the system to do that without knowing Unicode by heart. – LHMathies Jun 11 '21 at 08:52