1

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.

Example:

Source String: "add \uF06Cd1 Clause"

Result String: "add \F06Cd1 Clause"

How can achieve this in Java?

Edit:

Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.

Community
  • 1
  • 1
Maz
  • 653
  • 12
  • 22
  • 3
    Possible duplicate of [Java Regex - How to replace a pattern or how to](http://stackoverflow.com/questions/9285231/java-regex-how-to-replace-a-pattern-or-how-to) –  Jan 15 '17 at 23:56
  • Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work. – Maz Jan 16 '17 at 04:56

2 Answers2

2

The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.

The regex to match the unicode-string:

A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using

\\u[A-Fa-f\d]{4}

But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:

(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}

Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:

(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})

As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:

$1\\$3

Now for the actual code:

String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";

Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);

That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.

EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:

StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
   if(c > 127)
       sb.append("\\").append(String.format("%04x", (int) c));
   else
       sb.append(c);

This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.

duffy356
  • 3,678
  • 3
  • 32
  • 47
  • tried this as well. It gives me the result same as source. – Maz Jan 16 '17 at 01:03
  • @Maz did you run it on the source-code, or the string-literal? If you run it directly on the string you'll have to resort to another approach. This answer is supposed to filter the source-code, not the actual string. –  Jan 16 '17 at 01:06
  • Didn't quite understand what is source-code. I have created a stand alone class to test this. I define a string literal that has **add d1 Clause**, like `String s = "add \uF06Cd1 Clause";`, and then use 4 lines of the code from your comment. The result is same as string literal. – Maz Jan 16 '17 at 01:09
  • @Maz In the source-code, a string would look like this "\u202Eabc". The actual `String` would be "cba" (\u202E is text-reversal). –  Jan 16 '17 at 01:11
  • @Maz alright. You want to directly match unicode-characters as they are in the string and replace those. That's doable within certain limitations. I'll update my answer accordingly –  Jan 16 '17 at 01:12
  • Thanks Paul. Appreciate your help. – Maz Jan 16 '17 at 01:13
  • @Maz I've edited my answer with code that works on `String`-objects. –  Jan 16 '17 at 01:25
  • Thanks Paul. Co-incidentally I tool ended up with similar code. I have used `if(CharUtils.isAscii(c))` instead of your if condition. It worked fine. Hope no risks here. – Maz Jan 16 '17 at 03:23
  • And another change in my code is I have used `String.format("\\%04x", (int) c);` that gives me the result directly. – Maz Jan 16 '17 at 03:36
  • Instead of creating a StringBuilder and using `String.format` in your loop, which creates a new java.util.Formatter object each time, use `new Formatter()` (which implicitly wraps a StringBuilder) and use Formatter’s [format](https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html#format-java.lang.String-java.lang.Object...-) method. When finished, formatter.toString() yields the same result as StringBuilder.toString(). – VGR Jan 16 '17 at 03:53
  • @VGR, how will you use the formatter in else condition for above answer? – Maz Jan 16 '17 at 04:15
  • @Maz `if (c > 127) formatter.format("\\%04x", (int) c); else formatter.format("%c", c);` … or even the more compact `formatter.format(c > 127 ? "\\%04x" : "%c", (int) c);`. – VGR Jan 16 '17 at 04:17
  • @Maz `CharUtils.isAscii(c)` is actually just a more concise way of doing precisely what my code does. Internally it just does `c < 128`. –  Jan 16 '17 at 11:39
-4

Try using String.replaceAll() method

s = s.replaceAll("\u", "\");

  • Well, that'll work most of the time. But how about some string like "...\\u....". That's not a unicode-character, yet your code will happily override it. This is definitely not safe to use, as it will break sooner or later. –  Jan 16 '17 at 00:00
  • This one gives compilation errors. On escaping \, it doesn't give the desired result. – Maz Jan 16 '17 at 01:01