3

EDIT:

I am reading that string from a file, so this topic is actually about the following question:

I have this string which is the equal() to the one received from the file:

"Diogo Pi\\u00e7arra - Tu E Eu"

How can I make Java read the resulting string "\u00e7" as a "ç" character?

This happens because the file is not encoded in UTF-8 but in escaped Unicode, hence the reason why I am reading "\u00e7" as a string and not a Unicode character. So I need a function that parses this at runtime. I could iterate over .replace() functions to parse this but......


Old Question (asked in the wrong way before I understand what was going on, please ignore the following text):

I have the following string:

final String str = "Diogo Pi\u00e7arra - Tu E Eu";

and I want to convert it to:

"Diogo Piçarra - Tu E Eu"

I have tried everything, from Apache Lang tools unescape function, to

new String(str.getBytes("UTF-16"), "UTF-16")

or

new String(str.getBytes("UTF-8"), "UTF-8")

or

new String(str.getBytes("UTF-16"))

or

new String(str.getBytes("UTF-8"))

But nothing works...!

What can I try next?

Thanks!

PedroD
  • 5,670
  • 12
  • 46
  • 84

2 Answers2

4

The way I got it working for me, reading from a file with escaped unicode explicitly written:

    BufferedReader reader1 = new BufferedReader(new InputStreamReader(file.getInputStream()));
    byte c;
    while ((c = (byte) reader1.read()) != -1) {
        output.append(new String(new byte[] { c }, "UTF-8"));
    }
    return StringEscapeUtils.unescapeJava(output.toString());

This works because

StringEscapeUtils.unescapeJava("Diogo Pi\\u00e7arra - Tu E Eu")
results in "Diogo Piçarra - Tu E Eu"
PedroD
  • 5,670
  • 12
  • 46
  • 84
  • 1
    Please mark this as the accepted answer because the currently marked answer doesn't answer the question. Also your question doesn't seem to want UTF-8 because a Java string is UTF-16. – Tom Blodget Aug 05 '15 at 01:27
-1
final String str = new String("Diogo Pi\u00e7arra - Tu E Eu".getBytes(), 
                              Charset.forName("UTF-8"));

Result:

Try to use getBytes() method without parameters (defaultCharset will be used here). But it's not necessary. The conversion is not required:

final String str = "Diogo Pi\u00e7arra - Tu E Eu";

You'll have same result.

Andrew Tobilko
  • 48,120
  • 14
  • 91
  • 142
  • It is not working in my machine, however it works in IDEONE, even without doing that conversion: http://ideone.com/B3dwD9 – PedroD Aug 04 '15 at 21:19
  • I believe I am receiving an escaped string, the equivalent to my situation would be something like this I guess "Diogo Pi\\u00e7arra - Tu E Eu" (notice the double \). I am reading that string from a file, the file does not contain \\ but the \uXXXX is not interpreted as a special unicode character. – PedroD Aug 04 '15 at 21:22
  • @PedroD, look at [this](http://stackoverflow.com/questions/25548646/wrong-file-encoding-in-jvm-after-linux-update). You have an encoding error in JVM. – Andrew Tobilko Aug 04 '15 at 21:27
  • @PedroD, thereafter, tell me of results – Andrew Tobilko Aug 04 '15 at 21:29
  • Ok, now it returns utf-8 instead of ANSI_X3.4-1968 by adding the flag -Dfile.encoding=utf-8 to my java command, but the error persists. The problem is that I am reading an escaped string from a file (which I cannot change). What I am reading is the equivalent as "Diogo Pi\\u00e7arra - Tu E Eu". What I need to do is to force Java to read the string "\u00e7" as a unicode special character. I could do with with a .replace() but that would be reinventing the wheel I guess... – PedroD Aug 04 '15 at 21:36
  • This answer's example has gone offline, but they seem to tackle this exact issue: http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java – PedroD Aug 04 '15 at 21:43
  • In the end I need to implement this in Java lol: http://www.rapidmonkey.com/unicodeconverter/reverse.jsp – PedroD Aug 04 '15 at 21:44
  • I found a way to do it with StringEscapeUtils.unescapeJava(). However I **MUST** read from the file char by char (int, using read() method), and append it to a StringBuilder the following way *sb.append(new String(new byte[] { c }, "UTF-8"))*, then I preform *StringEscapeUtils.unescapeJava(sb.toString());*. – PedroD Aug 04 '15 at 22:11
  • This answer is wrong for multiple reasons: 1. in Java source `"\u00e7"` and `"ç"` are identical in every aspect 2. `getBytes()` is locale-dependent and might irreversibly mangle data 3. even if the locale uses UTF-8, `new String(x.getBytes(), Charset.forName("UTF-8"))` will always be equal to `x` 4. You should use `StandardCharsets.UTF8` anyway – Karol S Aug 06 '15 at 16:55