1

How can I create a regular expression to match distinct words?

I tried the following regex, but it also matches words embedded in other words:

@"(abs|acos|acosh|asin|asinh|atan|atanh)"

For example, with

@"xxxabs abs"

abs by itself should match, but not inside xxxabs.

zx81
  • 41,100
  • 9
  • 89
  • 105
Dmitry
  • 14,306
  • 23
  • 105
  • 189

1 Answers1

1

Although the solution (word boundaries) is an old classic, yours is an interesting question because the words in the alternation are so similar.

You can start with this:

\b(?:abs|acos|acosh|asin|asinh|atan|atanh)\b

And compress to that:

\b(?:a(?:cosh?|sinh?|tanh?|bs))\b

How does it work?

  1. The key idea is to use the word boundaries \b to ensure that the match is not embedded in a larger word.
  2. The idea of the compression is to make the engine match faster. It's hard to read, though, so unless you need every last drop of performance, that's purely for entertainment purposes.

Token-By-Token

\b                       # the boundary between a word char (\w) and
                         # something that is not a word char
(?:                      # group, but do not capture:
  a                      #   'a'
  (?:                    #   group, but do not capture:
    cos                  #     'cos'
    h?                   #     'h' (optional (matching the most
                         #     amount possible))
   |                     #    OR
    sin                  #     'sin'
    h?                   #     'h' (optional (matching the most
                         #     amount possible))
   |                     #    OR
    tan                  #     'tan'
    h?                   #     'h' (optional (matching the most
                         #     amount possible))
   |                     #    OR
    bs                   #     'bs'
  )                      #   end of grouping
)                        # end of grouping
\b                       # the boundary between a word char (\w) and
                         # something that is not a word char

Bonus Regex

In case you're feeling depressed today, this alternate compression (is it longer than the original?) should cheer you up.

\b(?:a(?:(?:co|b)s|(?:cos|(?:si|ta)n)h|(?:si|ta)n))\b
Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • zx81, I appreciate you for the answer. But it doesn't work for me. Does it for Objective-C on iOS SDK? – Dmitry Jun 08 '14 at 10:58
  • The correct string is `@"\\b(?:abs|acos|acosh|asin|asinh|atan|atanh)\\b"`. – Dmitry Jun 08 '14 at 12:23
  • Ah, great, you found the problem, thanks for letting me know. Yes, I gave you the regex, not the escaped string. It didn't occur to me that this was the problem. :) – zx81 Jun 08 '14 at 18:54