10

My application is using Spring Integration for email polling from Outlook mailbox.

As, it is receiving the String (email body)from an external system (Outlook), So I have no control over it.

For Example,

String emailBodyStr= "rejected by sundar14-\u200B.";

Now I am trying to remove the unicode character \u200B from this String.

What I tried already.

Try#1:

emailBodyStr = emailBodyStr.replaceAll("\u200B", "");

Try#2:

`emailBodyStr = emailBodyStr.replaceAll("\u200B", "").trim();`

Try#3 (using Apache Commons):

StringEscapeUtils.unescapeJava(emailBodyStr);

Try#4:

StringEscapeUtils.unescapeJava(emailBodyStr).trim();

Nothing worked till now.

When I tried to print this String using below code.

logger.info("Comment BEFORE:{}",emailBodyStr);
logger.info("Comment AFTER :{}",emailBodyStr);

In Eclipse console, it is NOT printing unicode char,

Comment BEFORE:rejected by sundar14-​.

But the same code prints the unicode char in Linux console as below.

Comment BEFORE:rejected by sundar14-\u200B.

I read some examples where str.replace() is recommended, but please note that examples uses javascript, PHP and not Java.

Sundararaj Govindasamy
  • 8,180
  • 5
  • 44
  • 77
  • 1
    The `replaceAll` approach works when I try it. – resueman Mar 22 '17 at 18:53
  • how you tested it? able to print? – Sundararaj Govindasamy Mar 22 '17 at 19:01
  • Before trying to replace it, I see an unprintable character in the output (shows up as "?"), and the length is 23. After replacing it, the unprintable character is gone and the length is 22. – resueman Mar 22 '17 at 19:04
  • Where are you seeing "?", in your IDE console? Which IDE you are using? When I used str.replaceAll(), I was also getting count 23 (before) and 22 (after), but When I stored this string (after str.replaceAll()) in database - I can see '\u200B' in DB. – Sundararaj Govindasamy Mar 22 '17 at 19:11
  • I see the question mark when running in Windows command prompt. Also, comparing the `String` with another one created without the \u200B shows that they're equal. – resueman Mar 22 '17 at 19:16
  • Simple replace works - http://ideone.com/MPGzqA – Wiktor Stribiżew Mar 22 '17 at 19:36
  • 1
    Big props figuring out your own answer! Fyi though, I believe the reason you originally had this problem is that you used `replaceAll()` instead of `replace()`. `replaceAll()` treats its first argument as a regex string, and since your input of `\u200B` has a \, it parses incorrectly and doesn't replace as it can't find the search string. – rococo Jun 28 '21 at 18:41
  • 1
    Excellent job writing this question! Very thorough. – devdanke May 05 '23 at 19:14

1 Answers1

24

Finally, I am able to remove 'Zero Width Space' character by using 'Unicode Regex'.

String plainEmailBody = new String();
plainEmailBody = emailBodyStr.replaceAll("[\\p{Cf}]", "");

Reference to find the category of Unicode characters.

  1. Character class from Java.

Character class from Java lists all of these unicode categories.

enter image description here

  1. Website: http://www.fileformat.info/

Character category

  1. Website: http://www.regular-expressions.info/ => Unicode Regular Expressions

Unicode Regex for \u200B character

Note 1: As I received this string from Outlook Email Body - none of the approaches listed in my question was working.

My application is receiving a String from an external system (Outlook), So I have no control over it.

Note 2: This SO answer helped me to know about Unicode Regular Expressions .

Community
  • 1
  • 1
Sundararaj Govindasamy
  • 8,180
  • 5
  • 44
  • 77