0

I am trying to delete/ replace whole words from a string.

I would like to do so case-insensitive and it should also work for special caracters, such as .,\ or /.

Do do so, I use the following code:

String result = Pattern.compile(stringToReplace, Pattern.LITERAL | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE).matcher(inputString)
                    .replaceAll("");

Like this, it works for special characters and it is case insensitive.

I know that I can enable whole word matching by using "\b".

I could do the following:

String result = Pattern.compile("\\b"+stringToReplace+"\\b",  Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE).matcher(inputString)
                    .replaceAll("");

This way it would match only whole words, but there would be problems for special characters. It interferes with Pattern.LITERAL. I need to disable this, which is not desired.

How can I combine Pattern.LITERAL with whole word matching?

kerner1000
  • 3,382
  • 1
  • 37
  • 57

1 Answers1

2

You must remember that the \b word boundary pattern is context dependent and matches between the start/end of string and a word char or between a word and a non-word char.

You need to use

String result = Pattern.compile("(?!\\B\\w)"+Pattern.quote(stringToReplace)+"(?<!\\w\\B)",  Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE).matcher(inputString)
                    .replaceAll("");

There are two main changes:

  • The stringToReplace needs to be Pattern.quoted to make sure all special characters are escaped
  • Adaptive word boundaries will make sure the word boundary is only required when necessary, i.e. when the neighbouring chars are word chars. (?!\B\w) is a left-hand adaptive word boundary and the (?<!\w\B) is a right-hand adaptive word boundary. Actually, it appears that both can be used interchangeably due to the nature of the zero-width assertions and the word boundary pattern, but this notation is best from the logical point of view.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    I’ve read this answer multiple times and I still don’t get, which problem using `"(?!\\B\\w)"` and `"(?<!\\w\\B)"` is supposed to solve, compared to a simple `"\\b" + Pattern.quote(stringToReplace) + "\\b"` – Holger Jan 19 '23 at 13:49
  • 1
    @Holger A very common issue, where one tries to match `Dr.` as a whole word in a `Dr. Alban` string with a `\bDr\.\b` pattern and can't figure out why it does not match. The [link](https://stackoverflow.com/q/45145626/3832970) in the answer leads to one of such issue discussions. – Wiktor Stribiżew Jan 19 '23 at 14:00
  • 1
    I see. I don’t know whether this has relevance to the OP’s problem or whether this still counts as “matching whole words”. Maybe, it would have been better to start with the problem the OP described (which is solved by using `Pattern.quote`) and then describe the potential problem with just using `\b` – Holger Jan 19 '23 at 14:03
  • 1
    @Holger `Pattern.quote` would not solve the issues since the problem as stated is broad and sounds as the "problems for special characters". Since special character positions are not clarified, the only solution is to take into account all edge cases that include special chars at the start or end of the search strings. When one wants to use word boundaries and says that the search strings can contain special chars, this is the only way: adjust the word boundary pattern + using `Pattern.quote`. – Wiktor Stribiżew Jan 19 '23 at 14:10
  • 1
    As you said, the position of the special characters is not given, so if you are presuming that these characters are at the beginning or end of the pattern, you should include it in your answer. As I said in my first comment, I did not understand which problem you are going to solve, given the question and the answer as-is. – Holger Jan 19 '23 at 14:18
  • @Holger I just provided a final solution, so that the OP did not have to ask another (duplicate) question like "why my `Dr.` whole word is not matched, look, I am using word boundaries". – Wiktor Stribiżew Jan 19 '23 at 14:37
  • This regex does not work for "\_". How do I need to change it to capture "\_" as well? – kerner1000 Feb 06 '23 at 08:29
  • @kerner1000 If you mean you want to treat the `_` char as a non-word char (so that it is counted as a word boundary) you need to subtract it from `\w`. I am not sure how to do this with adaptive word boundaries yet, but `"(?<![^\\W_])"+Pattern.quote(stringToReplace)+"(?![^\\W_])"` can work for you. – Wiktor Stribiżew Feb 06 '23 at 08:45
  • @WiktorStribiżew that does not seem to wok.. Am I not missing the \\B here? – kerner1000 Feb 06 '23 at 15:29