0

Inspired by this regex-based answer to a unicode question I now have following javascript code, which unescapes unicoded encoded character occurrences in strings.

var testString = "\\u53ef\\u4ee5NonUnicode\\u544a\\u8bc9\\u6211";
print(testString)

String.prototype.unescape = function() {
        return this.replace(/\\u([0-9a-f]{4})/g, 
                function (whole, group1) {
                    return String.fromCharCode(parseInt(group1, 16));
                }
            );
    };

print(testString.unescape()) // outputs: 可以NonUnicode告诉我

I could not find a way in Java (1.7) to do this kind of dynamic regex replacement, there are only static approaches like java.lang.String.replaceAll or java.util.regex.Matcher.group, which returns the group, but has no means to set it.

Is this even possible in Java? Are there any workarounds?

Community
  • 1
  • 1
mike
  • 4,929
  • 4
  • 40
  • 80
  • I guess the problem is the same as in [Howto unescape a Java string literal in Java](http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java)? – Wiktor Stribiżew Sep 14 '15 at 10:26

1 Answers1

2

It is pretty simple using Matcher.appendReplacement() and Matcher.appendTrail():

// Compile the pattern
Pattern p = Pattern.compile("\\\\u([0-9a-f]{4})");
// Create a matcher for our input
Matcher m = p.matcher(testString);
// Create a buffer to hold the resulting string
StringBuffer result = new StringBuffer();
// Iterate over matches
while(m.find()) {
  // Convert to UTF8 characters
  int codePoint = Integer.parseInt(m.group(1), 16);
  char[] chars = Character.toChars(codePoint);
  // Append to buffer
  m.appendReplacement(result, new String(chars));
}
// Append rest of string
m.appendTail(result);
// Display result
System.out.println(result);

You can test it here.

Tobias
  • 7,723
  • 1
  • 27
  • 44
  • It won't work, if there are non unicode characters in the string. I'll revise the string in the question to emphasize that usecase. – mike Sep 14 '15 at 10:37