2

In java, I am trying to determine if a user inputted string (meaning I do not know what the input will be) is contained exactly within another string, on word boundaries. So input of the should not be matched in the text there is no match. I am running into issues when there is punctuation in the inputted string however and could use some help.

With no punctuation, this works just fine:

String input = "string contain";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");

//both should and do match
System.out.println(p.matcher("does this string contain the input").find());
System.out.println(p.matcher("does this string contain? the input").find());

However when the input has a question mark in it, the matching with the word boundary doesn't seem to work:

String input = "string contain?";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");

//should not match - doesn't
System.out.println(p.matcher("does this string contain the input").find());

//expected match - doesn't
System.out.println(p.matcher("does this string contain? the input").find());

//should not match - doesn't
System.out.println(p.matcher("does this string contain?fail the input").find());

Any help would be appreciated.

Mr Zorn
  • 1,903
  • 1
  • 14
  • 25

3 Answers3

2

There's no word boundary between ? and , because there's no adjacent word character; that's why your pattern doesn't match. You can change it to this:

Pattern.compile("(^|\\W)" + Pattern.quote(input) + "($|\\W)");

That matches begin of input or non-word character - pattern - end of input or non-word character. Or, better, you use a negative lookbehind and a negative lookahead:

Pattern p = Pattern.compile("(?<!\\w)" + Pattern.quote(input) + "(?!\\w)");

This means, before and after your pattern there must not be a word character.

steffen
  • 16,138
  • 4
  • 42
  • 81
  • It's because if the `!` in-between, this is your word boundary. With your input sequence, "does this string contain?!fail the input", `string contain?!` does not match, but `string contain?` does. That's consistent. – steffen May 31 '17 at 19:56
  • sorry - deleted my comment before I saw your response as I realized my mistake. one more though, if the text ends with the input it doesn't seem to be matching, so in this example `does the string contain?` will fail but `does the string contain? ` (with a space) is good. One of these days I will learn my regex better! – Mr Zorn May 31 '17 at 20:38
  • @MrZorn Both do match. Read the last line: It matches, if there's not a following word character. This is not the case in both of your examples, so it works for both strings. – steffen May 31 '17 at 20:41
  • Ah - thought I grabbed the second pattern - that is working – Mr Zorn May 31 '17 at 20:45
1

You can use :

Pattern p = Pattern.compile("(\\s|^)" + Pattern.quote(input) + "(\\s|$)");
//---------------------------^^^^^^^----------------------------^^^^^^^

for Strings you will get :

does this string contain the input       -> false
does this string contain? the input      -> true
does this fail the input string contain? -> true
does this string contain?fail the input  -> false
string contain? the input                -> true

The idea is, matches the strings that contains your input + space, or end with your input.

Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140
  • Works partly on the right hand side as it only matches a space after `contain?`, but not e. g. another `?`. What about the left-hand side as in `¿does it?`. – steffen May 31 '17 at 18:27
  • @steffen you can try using `does this ?string contain? the input` it match correctly if `String input = "?string contain?";` – Youcef LAIDANI May 31 '17 at 18:31
  • OK now you've lifted the left hand side to be as good as the partly woking right hand side ;-) Check with `does this string contain?? the input` and `string contain?`. It should match I guess, but it doesn't. – steffen May 31 '17 at 18:37
  • nope my friend this should not match based on the 3rd example of OP `//should not match - doesn't System.out.println(p.matcher("does this string contain?fail the input").find());` you just replace the the `f` of fail by `?` i'm wrong @steffen :) – Youcef LAIDANI May 31 '17 at 18:39
  • No, I'm afraid you're wrong. The third example should not match, because there's a following word character. It should match, if there's a non-word character coming. That's what this is all about. – steffen May 31 '17 at 18:45
  • mmm, we can understand from the OP then @steffen if i'm correct then i'm, else i will gives you my up-vote if your answer is correct, it is similar to mine, so i can't use it in my answer :) thank you any way for your information i appreciate it – Youcef LAIDANI May 31 '17 at 18:48
  • The first problem I see with this solution is that it doesn't match if the text starts with the input "string contain? the input" - I guess I didn't include that explicit case in my examples – Mr Zorn May 31 '17 at 19:48
  • nope, it work i think you test with the oldest one, i already correct it @MrZorn it should be like this `Pattern.compile("(\\s|^)" + Pattern.quote(input) + "(\\s|$)");` after the comment of steffen – Youcef LAIDANI May 31 '17 at 19:49
0

You are matching using word boundaries: \b.

Java RegEx implementation deems following characters as word characters: \w := [a-zA-Z_0-9]

Any non-word characters are simply ones outside the above group [^\w] := [^a-zA-Z_0-9]

Word boundary is a transition from [a-zA-Z_0-9] to [^a-zA-Z_0-9] and vice-versa.

For input "does this string contain? the input" and literal pattern \\b\\Qstring contain?\\E\\b the last word boundary \\b falls within the input text into a transition from ? to <white space> and therefore is not a valid word to non-word nor non-word to word transition as per above definitions, which means that it is not a word boundary.

diginoise
  • 7,352
  • 2
  • 31
  • 39