1

I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.

String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\\b"+word+"\\b","");
System.out.println(phrase);

output:
has a bike and 's bike is red

What I want is
has a bike and bob's bike is red

I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like it to work with dashes so the phrase the new mail is e-mail would only replace the first occurrence of mail.

qw3n
  • 6,236
  • 6
  • 33
  • 62
  • I wouldn’t use `\b` in Java patterns: it’s superduper broken. There is a way to express it correctly, but this is unlikely to be what you want anyway. – tchrist Jan 22 '11 at 19:16
  • Here’s my torture-test stribng for pulling out individual words is: *James asked, “’Tis Renée’s and Noël’s great‐grandparents’ 1970's-ish summer‐house, t'isn’t it?”  Receiving no answer, he shook his head--and walked away.* As a Java string you could write that `"James asked, \u201C\u2019Tis Ren\u00E9e\u2019s and Noe\u0308l\u2019s great\u2010grandparents\u2019 1970's-ish summer\u2010house, t'isn\u2019t it?\u201D \u00A0 Receiving no answer, he shook his head--and walked away."`. Good luck! – tchrist Jan 22 '11 at 19:21
  • @tchrist I'm shaking my head an walking away lol. For what I am doing I do not need worry about all of those eventualities. But it does look like an interesting challenge. – qw3n Jan 22 '11 at 19:33

2 Answers2

2

It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as

phrase=phrase.replaceAll("([ \\s,.;])" + Pattern.quote(word)+ "([ \\s,.;])","$1$2");

But you'll have to check additionally for occurrences at the start and the end of the string For example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase=phrase.replaceAll("([\\s,.;])" + Pattern.quote(word) + "([\\s,.;])","$1$2");
  System.out.println(phrase);

prints this

bob has a bike ,  and boba bob's bike is red and "bob" stuff.

Update: If you insist in using \b, considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of ' by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:

  String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
  String word="bob";
  phrase= phrase.replace("'","ñ").replace('"','ö');
  phrase=phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b","");
  phrase= phrase.replace('ö','"').replace("ñ","'");
  System.out.println(phrase);

UPDATE: To summarize some comments below: one would expect \w and \b to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: \w considers ASCII, \b considers Unicode. It's an ugly inconsistence, I agree.

Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.

Community
  • 1
  • 1
leonbloy
  • 73,180
  • 20
  • 142
  • 190
  • @leonbloy Earlier in the program I already stripped out all of the punctuation except `'` and `-`. I liked the `\b` because it worked at the beginning and end of Strings where there would be no spaces. – qw3n Jan 22 '11 at 19:01
  • Well, '\b' is certainly nice, but I don't tkink Java allows you to redefine the "word" class. Another dirty trick is to replace your `'` for some word you are sure will not appear in the phrase, apply the original regex, and do the inverse replacement. Very dirty, but sometimes is practical. – leonbloy Jan 22 '11 at 19:11
  • I had thought briefly about substituting in a something for the `'` and `-`, but even though it is very hackish I think it is the quickest easiest solution. Thanks for your help. – qw3n Jan 22 '11 at 19:20
  • 2
    `\b` is **supposed** to be the same as `(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))` — but thoughtlessly Java broke that. Similarly, `\w` is **supposed** to be `[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]`. Swap one in for the other so that Java can understand it yields that `\b` is actually **supposed** to be `(?:(?<=\[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}ww\p{So}])(?!\[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])|(?<!\[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?=\[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]))`. – tchrist Jan 22 '11 at 19:33
  • @tchrist: Can you point to the docs that state that `\b` is not what it should be? Websites I found (http://answers.oreilly.com/topic/217-how-to-match-whole-words-with-a-regular-expression/, http://www.regular-expressions.info/wordboundaries.html) suggest that at least `\b` works the way it should. – Tomalak Jan 22 '11 at 19:36
  • @qw3n: The Unicode Dash property is eqivalent to `[\u002D\u058A\u05BE\u1400\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]`. And for the apostrophe, you probably want to use at least `[\u0027\u02BC\u2019\uFF07]`. – tchrist Jan 22 '11 at 19:39
  • @Tomalak: Since its creation, `\b` has always been defined to be a `\w` transition. Larry Wall created it for Perl 0, long ago and far away. That is what it has always meant, and that’s why it is exactly identical to `(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))`. – tchrist Jan 22 '11 at 19:42
  • @Tomalak: The simple existence proof that Java has managed to completely screw up `\b` beyond redemption is that the pattern `\b\w+\b` fails to match the string `"élève"` **at any point whatsover!** That violates a fundamental axiom of what `\b` means. – tchrist Jan 22 '11 at 19:45
  • @tchrist: While I appreciate elaborative comments, they break the layout, as you see. It could be that setting up a test case at ideone.com and linking to it is more practical. – Tomalak Jan 22 '11 at 20:58
  • @Tomalak: Java left `\w` as stuck in ancient ASCII alone, which is contrary to [UTS#18](http://www.unicode.org/reports/tr18/#Compatibility_Properties), and also changed the meaning of `\b` so that it no longer relates to `\w`. Java ignores the Unicode standard in this, and it also ignores every other programming language I have ever seen, which all guarantee that `"élève"` has at least one possible match (and perhaps several) for `\b\w+\b`. Nothing else behaves the way Java does in this regard. – tchrist Jan 22 '11 at 21:39
  • Do you mind updating this post with `U` flag ([UNICODE_CHARACTER_CLASS](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS))? Java has finally done it right from Java 7. – nhahtdh Feb 10 '14 at 08:28
1
\b\S*(bob|mail)\S*\b

Be careful with false positives, this could match more than you want. If you need "prefixes" or "sufixes" of no more than 2 characters (that would be things like "'s" or "e-"), use \S{0,2} instead of \S*.

The regex says:

\b           # a word boundary
\S*          # any number of non-spaces
(            # match group 1 (to enable a choice) 
  bob|mail   #   "bob" or "mail"
)            # end match group 1
\S*          # any number of non-spaces
\b           # a word boundary

So, in Java:

phrase = phrase.replaceAll("\\b\\S*(bob|mail)\\S*\\b", "");

Be careful with things like

phrase = phrase.replaceAll("\\b" + word + "\\b", "");

That should be

phrase = phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b", "");

since whenever word contains regex meta characters, your regex will break unless you properly escape the string beforehand using Pattern.quote().

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • No, this is misleading. You mustn’t mix word boundaries with things like non-ASCIISpace. That’s because the pattern `\b\S+bob` **fails to match** strings like `"==bob"`. Also, `\b` is utterly broken in Java, since the string `"élève"` fails to be matched by the pattern `\b\w+\b` both in its entirely via `Pattern.matches()` but indeed even at all at any point within that string using `Pattern.find()`! – tchrist Jan 22 '11 at 19:15
  • @tchrist: Things like "==bob" were not part of the question specification, so I did not go through the trouble of anticipating them. Good point about `\b` being broken, though. However, I'll leave the answer unchanged for now, as it works with the cases the OP gave. Unless the question becomes more specific, I'll have nothing to work with and can't really improve my answer. – Tomalak Jan 22 '11 at 19:21
  • @Tomalak: Agreed. One trouble with so many of these questions is that they don’t actually spec out the problem completely enough for a proper solution. Another is when they think their approach to a solution is necessarily the way to go about it in the first place. – tchrist Jan 22 '11 at 19:26
  • @tchrist: Some sources suggest the `\b` is in fact Unicode-enabled in Java: http://answers.oreilly.com/topic/217-how-to-match-whole-words-with-a-regular-expression/ and things would therefore work. – Tomalak Jan 22 '11 at 19:26
  • @Tomalak I mentioned on the other post that I stripped all non-alphabet characters out except for `'` and `-`. I should have mentioned that in the question. I halfway understand your answer, but it seems like it matches anything with bob in it and then replaces the whole word. I want it to only replace bob if it is bob and not bob's or hi-bob. – qw3n Jan 22 '11 at 19:26
  • @wq3n: Read though the statement I made in my answer about length-limiting the expression. – Tomalak Jan 22 '11 at 19:29
  • @Tomalak: It is true that `\b` has Unicode sensitivity in Java, but that does not mean that it is not broken. Both happen to be true. – tchrist Jan 22 '11 at 19:48
  • @tchrist: Hm, good to know. I'll keep that in the back of my mind. – Tomalak Jan 22 '11 at 21:02