1

how can I convert a string like:

URLDecoder.decode("promo desc %u20AC", "UTF-16");

into "promo desc €" ? In fact the method above doesn't work because % indicates a hex string whilst u20AC is not a valid hex string. The string to decode is generated by a Javascript like this:

var string = escape("{€ć") ---> "%7B%u20AC%u0107"

I didn't want to use URLDecoder because, semantically, it's not a URL I'm trying to decode but a very long text. In java % indicates a hex string and %u is illegal. I think that converting % to \ is a bit naive, there may be sequences of % in the text. What I am after is this function here:

unescape("%7B%u20AC%u0107")

that exists in Javascript but not in Java to my knowledge. How can I achieve this in Java?

Thanks

marco_sap
  • 1,739
  • 2
  • 12
  • 12
  • Strip out the percent sign first? – Robert Harvey May 26 '21 at 16:02
  • 2
    Where is this broken data coming from? Can you fix that rather than having to work around it? – Jon Skeet May 26 '21 at 16:10
  • There is no broken data at all, the code comes from the function escape in javascript which turns € into %u20AC. Likewise it turns ä into %E4 and this needs to be converted back to ä. So basically Javascript escape('€')=%u20AC --> java should translate back to €. I cannot replace % by \ because I would replace possible % as well. And I must find a general solution for other symbols too like %E4 etc. Any idea? – marco_sap May 27 '21 at 07:36

1 Answers1

1

I was curious, because I've not seen the %u escapes before, but it turns out unescaping them is fairly easy:

private static final Pattern JAVASCRIPT_ESCAPE_SEQUENCE= Pattern.compile("%(u[0-9a-fA-F]{4}|[0-9a-fA-F]{2})");

/**
 * Unescape a JavaScript-escaped string.
 * Undoes the effect of calling the <a href="https://developer.mozilla.org/de/docs/Web/JavaScript/Reference/Global_Objects/escape">
 * the JavaScript escape method</a>.
 */
static String unescape(String input) {
    Matcher matcher = JAVASCRIPT_ESCAPE_SEQUENCE.matcher(input);
    StringBuilder sb = new StringBuilder(input.length());
    while(matcher.find()) {
        String escapeSequence = matcher.group(1);
        if (escapeSequence.startsWith("u")) {
            escapeSequence = escapeSequence.substring(1);
        }
        char c = (char) Integer.parseInt(escapeSequence, 16);
        matcher.appendReplacement(sb, Character.toString(c));
    }
    matcher.appendTail(sb);
    return sb.toString();
}

Given this method unescape("%7B%u20AC%u0107") produces the desired output {€ć.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Thank you very much but it does not work. It generates: {€? Besides instead of StringBuilder it's StringBuffer. How to get the desired output "{€ć" ? – marco_sap May 31 '21 at 08:02
  • 1
    StringBuffer is the older, synchronized version that's not necessary here. Both StringBuffer and StringBuilder work with Pattern/Matcher (since Java 9, if you are stuck in the ancient before-land then you'll need to use StringBuffer, yes). And yes: it does work, I have verified it. If you get `{€?` then there's a problem at some later point where the encoding used can't represent the last character. Print `(int) output.charAt(2)` and you'll see that it's 263 for `ć` and not 63 (which would be `?`). – Joachim Sauer May 31 '21 at 09:01
  • I tried to run it on https://www.tutorialspoint.com/compile_java_online.php and it works. Where can the problem be with my setup? I mean why my java environment is not able to represent the character ć? Besides, I've used this routine in a servlet in the cloud and I'm getting the same there, the same issue. Should I be setting some specific encoding? – marco_sap May 31 '21 at 13:26
  • 1
    That can have so many different reasons. There's no point analyzing this in a comment. Ask it as a new question, maybe. If you do make sure to make it as self-contained as possible (i.e. skip the decoding this escaping part, for example and use just `"\u0107"` as the string to output). – Joachim Sauer May 31 '21 at 13:30