2

I am learning regex. One of the problem requires me to find all words that begin with a vowel. I am using Python's re module for evaluating the regular expression.

Here is the regex I made:

\<[aeiouAEIOU].*?\>

The above regex does not work with the \< and the \> anchor but works with the \b anchor. Why?

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I think different engines/flavors support different syntax for the same thing but I am not sure. – Vishal Singh Mar 22 '21 at 05:40
  • Well what is the text agains which you are trying to match, and can you include that in your question? – Tim Biegeleisen Mar 22 '21 at 05:42
  • @VishalSingh "GNU" is not well-defined in this context. Some GNU utilities use `\<` ... `/>`, others use `\b`, others have no support at all for regex word boundaries. – tripleee Mar 22 '21 at 06:23

2 Answers2

2

"Does not work" is not correct; one works in some regex dialects, the other in others.

Most "modern" regex dialects (Python, Perl, Ruby, etc) use \b as the word boundary, on both sides.

More traditional regex dialects, like the original egrep, use \< as the left word boundary operator, and \> on the right.

(Strictly speaking, Al Aho's original egrep did not have word boundaries; this feature was added later. Maybe see https://stackoverflow.com/a/39367415/874188 for a one-minute summary of regex history.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Just tried the regex in sublime text search and it worked. I am now curious to know why did the python code run. It should have thrown a warning. Anyways I learned something new: regex has different implementations – InputBlackBoxOutput Mar 23 '21 at 06:18
1

Python re does not support "leading/starting word boundary" \< construct (in other regex flavors, also \m or [[:<:]]), nor the "closing/trailing word boundary", \> (in other regex flavors, also \M or [[:>:]]).

Note that leading and trailing word boundaries are not supported by most NFA, often referred to as "modern", regex engines. The usual way is to use \b, as you have already noticed, because it is more convenient.

However, this convenience comes with a price: \b is a context-depending pattern. This problem has been covered very broadly on SO, here is my answer covering some aspects of \b, see Word boundary with words starting or ending with special characters gives unexpected results.

So, if you plan to use \< or \>, you need to implement them manually like this:

  • \< = a position at a word boundary where the char to the right is a word char, i.e. \b(?=\w).
  • \> = a position at a word boundary where the char to the left is a word char, i.e. \b(?<=\w).

This is how these word boundary variants are handled in the PCRE library:

COMPATIBILITY FEATURE FOR WORD BOUNDARIES

In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of word". PCRE treats these items as follows:

[[:<:]] is converted to \b(?=\w)
[[:>:]] is converted to \b(?<=\w)

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563