1

I am using Python's re module to capture all modifiers of word color in Am. English (AmE) and Br. English (BrE). I successfully captured almost all words, with the exception of words that end with apostrophe. E.g. colors' This problem is from Watt's Beginning Reg Exp book.

Here's sample text:

Red is a color.
His collar is too tight or too colouuuurful.
These are bright colours.
These are bright colors.
Calorific is a scientific term.
“Your life is very colorful,” she said.
color (U.S. English, singular noun)
colour (British English, singular noun)
colors (U.S. English, plural noun)
colours (British English, plural noun)
color’s (U.S. English, possessive singular)
colour’s (British English, possessive singular)
colors’ (U.S. English, possessive plural)
colours’ (British English, possessive plural)

Here's my regex: \bcolou?r(?:[a-zA-Z’s]+)?\b

Explanation:

\b                 # Start at word boundary
colou?r            #u is optional for AmE
    (?:            #non-capturing group
    [a-zA-Z’s]+    #color could be followed by modifier (e.g.ful, or apostrophe)
    )?             #End non-capturing group; these letters are optional
\b                 # End at word boundary

The issue is that colors’ and colours’ are matched until s. Apostrophe is ignored. Can someone please explain what is wrong with my code? I researched this on SO Regex Apostrophe how to match?, and the problems there are about escaping ' and ".

Here's Regex101

Thanks in advance.

watchtower
  • 4,140
  • 14
  • 50
  • 92

2 Answers2

2

The problem is that \b is a word boundary, and with ...lors’, the position between the and the following space is not a word boundary, because neither the nor the space are word characters. Instead of \b, use lookahead for a space, a period, a comma, or whatever else may come afterwards:

\bcolou?r(?:[a-zA-Z’s]+)?(?=[ .,])

https://regex101.com/r/lB49Nr/3

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Many thanks. I am curious why did you add `?=[ .,]` when `\bcolou?r(?:[a-zA-Z’s]+)?` regex works without `?=`. I am curious. Thanks for your help. – watchtower Oct 07 '18 at 06:16
  • 2
    Sure, you could do that too, but then you'll match, for example, `color` in `color256` or `color` in `color_set`, and so on for other characters not in the `[a-zA-Z’s]` character set. Maybe that's an issue for you, maybe it isn't, I was just trying to be faithful to the original intent of your `\b`. – CertainPerformance Oct 07 '18 at 06:21
0

The problem is the ending \b. by definition it says

\b Matches, without consuming any characters, immediately between a character matched by \w and a character not matched by \w (in either order). It cannot be used to separate non words from words.

is not in \w group. Try remove the ending it: \bcolou?r(?:[a-zA-Z’s]+)?

digitake
  • 846
  • 7
  • 16