0

I was able to figure out how to convert a Unicode string to an ASCII string using the following code. (Credits are in the code)

    //create a string using unicode that says "hello" when printed to console
    String unicode = "\u0068" + "\u0065" + "\u006c" + "\u006c" + "\u006f";
    System.out.println(unicode);
    System.out.println("");

    /* Test code for converting unicode to ASCII
     * Taken from http://stackoverflow.com/questions/15356716/how-can-i-convert-unicode-string-to-ascii-in-java
     * Will be commented out later after tested and implemented.
     */
    //String s = "口水雞 hello Ä";

    //replace String s with String unicode for conversion
    String s1 = Normalizer.normalize(unicode, Normalizer.Form.NFKD);
    String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(unicode.length() == s2.length());
    //End of Test code that was implemented

Now, my problem and curiosity has gotten the better of me. I've attempted googling seeing as I don't have the best knowledge with Java.

My question is, Is it possible to convert an ASCII string to a UTF format? Especially UTF-16. (I say UTF-16 because I know how similar UTF-8 is to ASCII and it would not be necessary to convert to UTF-8 from ASCII)

Thanks in advance!

Rion Murph Murphy
  • 113
  • 1
  • 5
  • 15

1 Answers1

1

Java strings use UTF-16 as internal format and it's not relevant, as the String class takes care of it. You will see the difference only in two cases:

  1. when examining the String as an array of bytes (see below). This what happens in C all the time, but it's not the case with more modern languages with proper distinction between a string and an array of bytes (e.g. Java or Python 3.x).
  2. when converting to a more restrictive encoding (which is what you did, UTF-8 to ASCII), as some characters will need to be replaced.

If you want to encode the content to UTF-16 before writing to a file (or equivalent), you can do it with:

String data = "TEST";
OutputStream output = new FileOutputStream("filename.txt");
output.write(data.getBytes("UTF-16"));
output.close();

And the resulting file will contain:

0000000: feff 0054 0045 0053 0054                 ...T.E.S.T

Which is UTF-16 with BOM bytes at the beginning.

Stefano Sanfilippo
  • 32,265
  • 7
  • 79
  • 80