1

I have a string as follows:

this is the string u00c5 with missing slash before unicode characters

It has unicode character codes but all the backslashes before the "u" is missing. How can print this string correctly?

What I have done?

I tried to add a backslash before the incomplete unicode part using the following code. However, "\u$1" is not allowed in replaceAll.

public String sanitizeUnicodeQuirk(String input) {
    try {
        // String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", String.valueOf(Integer.parseInt("$1", 16)));    // $1 is taken literally which makes valuOf and parseInt useless
        String processedInput = input.replaceAll("[uU]([0123456789abcdefABCDEF]{4})", "\\\\u$1");    // Cannot make "\u$1"
        String newInput = new String(processedInput.getBytes(), "UTF-8");
        return newInput;
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }

    return input;
}
Mehmed
  • 2,880
  • 4
  • 41
  • 62
  • 2
    `"u00c5".replaceAll("([uU][0123456789abcdefABCDEF]{4})", "\\\\$1")` gives you `\u00c5` what is the issue? You just have to remove the `u` in the second arg of the replaceAll method. – alain.janinm Jan 27 '17 at 12:30
  • 2
    The translation from unicode escape sequences to characters occurs at compile time. – Alderath Jan 27 '17 at 12:31
  • my mistake, it should be "[uU](...)". I also couldn't get it working other way. – Mehmed Jan 27 '17 at 12:32
  • 1
    Possible duplicate of [How to display currency symbol from utf-8 values?](http://stackoverflow.com/questions/41833786/how-to-display-currency-symbol-from-utf-8-values) – Alastair McCormack Jan 27 '17 at 12:35
  • @Alderath so there is no way to do it except grabbing four hex codes and converting them into character, am I right? – Mehmed Jan 27 '17 at 12:39
  • Possible duplicate of [How to convert a string with Unicode encoding to a string of letters](http://stackoverflow.com/questions/11145681/how-to-convert-a-string-with-unicode-encoding-to-a-string-of-letters) – alain.janinm Jan 27 '17 at 13:28
  • @Mehmed Probably not. At least it is not possible to rely on unicode escape sequences. Assume that your `input` argument is a string which was read from a file, and the string is exactly "\u00c5". That would still remain as the six character string "\u00c5", it would not become a one character string with the corresponding unicode character. The unicode escape sequences are only processed at compile time and only if the string was specified as a string literal in the source code. – Alderath Jan 27 '17 at 15:04

1 Answers1

0

Yikes. Proof of concept using the possible duplicate link provided by @AlastairMcCormack in the comments:

public class Test {
    public static void main(String[] args) {
        String input = "this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters";
        System.out.println("Original input: " + input);
        Pattern pattern = java.util.regex.Pattern.compile("[uU][0-9a-fA-F]{4}");
        Matcher matcher = pattern.matcher(input);
        StringBuilder builder = new StringBuilder();
        int lastIndex = 0;
        while (matcher.find()) {
               String codePoint = matcher.group().substring(1);
               System.out.println("Found code point: " + codePoint);
               Character charSymbol = (char) Integer.parseInt(codePoint, 16);
               builder.append(input.substring(lastIndex, matcher.start()) + charSymbol);
               lastIndex = matcher.end();
        }
        builder.append(input.substring(lastIndex));
        System.out.println("Modded input: " + builder.toString());
    }
}

Yields:

Original input: this is the string u0075u0031u0032u0033u0034 with missing slash before unicode characters
Found code point: 0075
Found code point: 0031
Found code point: 0032
Found code point: 0033
Found code point: 0034
Modded input: this is the string u1234 with missing slash before unicode characters

It does make sense that the code point is encoded as a String of characters and no amount of simple scrubbing with regexes is going to fix that. It's not pretty so I'd be pretty happy too if someone had another way.

Jon Sampson
  • 1,473
  • 1
  • 21
  • 31
  • I didn't want to use java.util.regex in an Android app. Would it bring a lot burden if I apply this on each text of listview item? Is there faster way? – Mehmed Jan 27 '17 at 13:48
  • I can't speak to Android specifically, but I assume it's really a matter of scale. The more of these replaces you run, the more time it will take of course. If you have control over where those list view items are coming from then you might simply want to move as much processing off of the Android client as possible. Just a note, in case it was not obvious your sanitizeUnicodeQuirk method's use of [String.replaceAll](https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29) is hiding the creation of a Pattern, Matcher. – Jon Sampson Jan 27 '17 at 14:00
  • 1
    @Mehmed The implementation of `String.replaceAll` uses `java.util.regex`. The implementation of that function is: `return Pattern.compile(regex).matcher(this).replaceAll(replacement);` – Alderath Jan 27 '17 at 15:09
  • 1
    Modifying the `input` string while you are still looping through its matches is not a good idea. You should call `matcher()` one time only and then use the results to build up a separate `string` variable. This also prevents the corner case of the original `input` string containing a string like `u0075u0031u0032u0033u0034` that happens to get decoded into `u1234` and then the next call to `matcher()` finds that and it gets decoded again into `ሴ` by accident. Avoid double-replace issues in your code, that is a good way to corrupt data. – Remy Lebeau Jan 28 '17 at 01:44
  • @RemyLebeau - u0075u0031u0032u0033u0034 is truly beautiful. Great note. For posterity I'll make an edit. Thanks! – Jon Sampson Jan 28 '17 at 02:21
  • 1
    @JonSampson: I would get rid of the `Character` variable (why not use `char` anyway?), get rid of the `substring()` (you can pass the indexes directly to `append()`), and break up the `append()` into two separate calls to avoid allocating a temporary `string`, eg: `builder.append(input, lastIndex, matcher.start()).append((char) Integer.parseInt(codePoint, 16));` – Remy Lebeau Jan 28 '17 at 02:48