Java %u20AC conversion to euro €

Question

how can I convert a string like:

URLDecoder.decode("promo desc %u20AC", "UTF-16");

into "promo desc €" ? In fact the method above doesn't work because % indicates a hex string whilst u20AC is not a valid hex string. The string to decode is generated by a Javascript like this:

var string = escape("{€ć") ---> "%7B%u20AC%u0107"

I didn't want to use URLDecoder because, semantically, it's not a URL I'm trying to decode but a very long text. In java % indicates a hex string and %u is illegal. I think that converting % to \ is a bit naive, there may be sequences of % in the text. What I am after is this function here:

unescape("%7B%u20AC%u0107")

that exists in Javascript but not in Java to my knowledge. How can I achieve this in Java?

Thanks

Where is this broken data coming from? Can you fix that rather than having to work around it? — Jon Skeet, May 26 '21 at 16:10
There is no broken data at all, the code comes from the function escape in javascript which turns € into %u20AC. Likewise it turns ä into %E4 and this needs to be converted back to ä. So basically Javascript escape('€')=%u20AC --> java should translate back to €. I cannot replace % by \ because I would replace possible % as well. And I must find a general solution for other symbols too like %E4 etc. Any idea? — marco_sap, May 27 '21 at 07:36

score 1 · Answer 1 · answered May 29 '21 at 11:21

1

I was curious, because I've not seen the %u escapes before, but it turns out unescaping them is fairly easy:

private static final Pattern JAVASCRIPT_ESCAPE_SEQUENCE= Pattern.compile("%(u[0-9a-fA-F]{4}|[0-9a-fA-F]{2})");

/**
 * Unescape a JavaScript-escaped string.
 * Undoes the effect of calling the <a href="https://developer.mozilla.org/de/docs/Web/JavaScript/Reference/Global_Objects/escape">
 * the JavaScript escape method</a>.
 */
static String unescape(String input) {
    Matcher matcher = JAVASCRIPT_ESCAPE_SEQUENCE.matcher(input);
    StringBuilder sb = new StringBuilder(input.length());
    while(matcher.find()) {
        String escapeSequence = matcher.group(1);
        if (escapeSequence.startsWith("u")) {
            escapeSequence = escapeSequence.substring(1);
        }
        char c = (char) Integer.parseInt(escapeSequence, 16);
        matcher.appendReplacement(sb, Character.toString(c));
    }
    matcher.appendTail(sb);
    return sb.toString();
}

Given this method unescape("%7B%u20AC%u0107") produces the desired output {€ć.

answered May 29 '21 at 11:21

Joachim Sauer

302,674
57
556
614

Thank you very much but it does not work. It generates: {€? Besides instead of StringBuilder it's StringBuffer. How to get the desired output "{€ć" ? – marco_sap May 31 '21 at 08:02
1

StringBuffer is the older, synchronized version that's not necessary here. Both StringBuffer and StringBuilder work with Pattern/Matcher (since Java 9, if you are stuck in the ancient before-land then you'll need to use StringBuffer, yes). And yes: it does work, I have verified it. If you get `{€?` then there's a problem at some later point where the encoding used can't represent the last character. Print `(int) output.charAt(2)` and you'll see that it's 263 for `ć` and not 63 (which would be `?`). – Joachim Sauer May 31 '21 at 09:01
I tried to run it on https://www.tutorialspoint.com/compile_java_online.php and it works. Where can the problem be with my setup? I mean why my java environment is not able to represent the character ć? Besides, I've used this routine in a servlet in the cloud and I'm getting the same there, the same issue. Should I be setting some specific encoding? – marco_sap May 31 '21 at 13:26
1

That can have so many different reasons. There's no point analyzing this in a comment. Ask it as a new question, maybe. If you do make sure to make it as self-contained as possible (i.e. skip the decoding this escaping part, for example and use just `"\u0107"` as the string to output). – Joachim Sauer May 31 '21 at 13:30

Java %u20AC conversion to euro €

1 Answers1