1

I have the following regular expression, that I am compiling with Pattern class.

\bIntegrated\s+Health\s+System\s+\(IHS\)\b

Why is this not matching this string?

"test pattern case Integrated Health System (IHS)."

If I try \bpattern\b, it seems to work, but for the above phrase it does not. I have the parenthesis in the pattern escaped, so not sure why it doesn't work. It does match if I remove the parenthesis portion of the pattern, but I want to match the whole thing.

Eqbal
  • 4,722
  • 12
  • 38
  • 47
  • I did escape it, stackoverflow un-escaped :). My expression reads like this: \bIntegrated\s+Health\s+System\s+\\(IHS\\)\b – Eqbal Jan 05 '10 at 23:30
  • You should edit your question rather than adding a comment. – Jherico Jan 05 '10 at 23:31
  • SO doesn't know tags. Just indent with 4 spaces or select it and press `010101` button or `Ctrl+K`. Also see the Markdown FAQ on the right hand of the message editor. – BalusC Jan 05 '10 at 23:33
  • Got it (indenting 4 spaces for code)! Thanks! – Eqbal Jan 05 '10 at 23:37

3 Answers3

1

1) escape the parens, otherwise they are capturing and group metacharacters, not literal parenthesis \( \)

2) remove the final \b you can't use a word boundary after a literal ), since ) is not considered part of a word.

\bIntegrated\s+Health\s+System\s+\(IHS\)\W
Paul Creasey
  • 28,321
  • 10
  • 54
  • 90
  • Okay, how do I indicate the trailing boundary then, so it does not match something like \bIntegrated\s+Health\s+System\s+\\(IHS\\)testing I need to make sure it only matches the whole phrase and not some string that starts with this phrase. – Eqbal Jan 05 '10 at 23:35
  • 1
    you could use \W which is the same as [^\w] or [^a-bA-B0-9_] (not sure exactly what it includes in java), or you could create you own character class (or negated class) to specify what does or does not indicate a match. I've updated the example with \W which will likely work pretty well. – Paul Creasey Jan 05 '10 at 23:48
  • Thanks, \W seems to work pretty well so far combined with grouping to extract the matched phrase minus the non-word character that follows. – Eqbal Jan 06 '10 at 00:00
  • If you want to allow the match at the end of the string you would have to say `($|\W)`. I'm not sure it's so important though, are you likely to have strings like `Integrated Health Systems (IHS)foo`? The close bracket is almost invariably followed by space or punctuation. – bobince Jan 06 '10 at 00:16
  • Okay, here is my final regex pattern: `"(\\b|\\W)(" + phrase + ")($|\\W)"` Using the group 2 to get the matched phrase. – Eqbal Jan 06 '10 at 01:17
  • Hmm. That causes problem if the phrase begins with a "(". So modified it to `"(^|\\W)(" + phrase + ")($|\\W)"` – Eqbal Jan 06 '10 at 01:27
0

You've got (IHS) - a group - where you want \(IHS\) as the literal brackets.

cyborg
  • 5,638
  • 1
  • 19
  • 25
0

You need to escape the parentheses

\bIntegrated\s+Health\s+System\s+\(IHS\)\b

Parentheses delimit a capture group. To match a literal set of parentheses, you can escape them like this \( \)

mopoke
  • 10,555
  • 1
  • 31
  • 31
  • It isn’t safe to use `\b` in Java. It doesn’t mean what you think it does. [See here](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) for why. – tchrist Dec 02 '10 at 03:21