Convert "php unicode" to character

Question

How can I convert so called "php unicode"(link to php unicode) to normal character via Java? Example \xEF\xBC\xA1 -> A. Are there any embedded methods in jdk or should I use regex for this conversion?

Is your input in string format (`\xNN`) or binary format? – casablanca Nov 23 '10 at 23:04 — casablanca, Nov 23 '10 at 23:04

score 2 · Answer 1 · edited May 23 '17 at 10:26

You first need to get the bytes out of the string into a byte-array without changing them and then decode the byte-array as a UTF-8 string.

The simplest way to get the string into a byte array is to encode it using ISO-8859-1 which map every character with a unicode value less than 256 to a byte with the same value (or the equivalent negative)

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1"); // maps to bytes with the same ordinal value
String javaString = new String(bytes, "UTF-8");
System.out.println(javaString);

Edit
The above converts the UTF-8 to the Unicode character. If you then want to convert it to a reasonable ASCII equivalent, there's no standard way of doing that: but see this question

Edit
I assumed that you had a string containing characters that had the same ordinal value as the UTF-8 sequence but you indicate that your string literally contains the escape sequence, as in:

String phpUnicode = "\\xEF\\xBC\\xA1";

The JDK doesn't have any built-in methods to convert Strings like this so you'll need to use your own regex. Since we ultimately want to convert a utf-8 byte-sequence into a String, we need to set up a byte-array, using maybe:

Pattern oneChar = Pattern.compile("\\\\x([0-9A-F]{2})|(.)", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = oneChar.matcher(phpUnicode);
ByteArrayOutputStream bytes = new ByteArrayOutputStream();

while (matcher.find()) {
    int ch;
    if (matcher.group(1) == null) {
        ch = matcher.group(2).charAt(0);
    }
    else {
        ch = Integer.parseInt(matcher.group(1), 16);
    }
    bytes.write((int) ch);
}
String javaString = new String(bytes.toByteArray(), "UTF-8");
System.out.println(javaString);

This will generate a UTF-8 stream by converting \xAB sequences. This UTF-8 stream is then converted to a Java string. It's important to note that any character that is not part of an escape sequence will be converted to a byte equivalent to to the low-order 8 bites of the unicode character. This works fine for ascii but can cause transcoding problems for non-ascii characters.

@McDowell:
The sequence:

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1");

creates a byte array containing as many bytes as the original string has characters and for each character with a unicode value below 256, the same numeric value is stored in the byte-array.

The character FULLWIDTH LATIN CAPITAL LETTER A (U+FF41) is not present in the original String so the fact that it is not in ISO-8859-1 is irrelevant.

I know that transcoding bugs can occur when you convert characters to bytes that's why I said that ISO-8859-1 would only "map every character with a unicode value less than 256 to a byte with the same value"

Nice but than I need to convert \xNN\xNN string to unicode string, I have written a regexp that catches NN characters but how can I create an unicode string from NN ? F.e. i have NN i need "\u0NN"(string addition doesnt work here) — Le_Coeur, Nov 24 '10 at 10:31
Java strings are UTF-16; trying to represent UTF-8 in them (`"\u00EF\u00BC\u00A1"`) will just lead to transcoding bugs. In any case, the character FULLWIDTH LATIN CAPITAL LETTER A is not present in ISO-8859-1. — McDowell, Nov 24 '10 at 12:02

score 1 · Accepted Answer · answered Nov 24 '10 at 11:59

The character in question is U+FF21 (FULLWIDTH LATIN CAPITAL LETTER A). The PHP form (\xEF\xBC\xA1) is a UTF-8 encoded octet sequence.

In order to decode this sequence to a Java String (which is always UTF-16), you would use the following code:

// \xEF\xBC\xA1
byte[] utf8 = { (byte) 0xEF, (byte) 0xBC, (byte) 0xA1 };
String utf16 = new String(utf8, Charset.forName("UTF-8"));

// print the char as hex   
for(char ch : utf16.toCharArray()) {
    System.out.format("%02x%n", (int) ch);
}

If you want to decode the data from a string literal you could use code of this form:

public static void main(String[] args) {
  String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
  for (char ch : utf16.toCharArray()) {
    System.out.format("%s %02x%n", ch, (int) ch);
  }
}

private static final Pattern SEQ 
                           = Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");

private static String transformString(String encoded) {
  StringBuilder decoded = new StringBuilder();
  Matcher matcher = SEQ.matcher(encoded);
  int last = 0;
  while (matcher.find()) {
    decoded.append(encoded.substring(last, matcher.start()));
    byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
    decoded.append(new String(utf8, Charset.forName("UTF-8")));
    last = matcher.end();
  }
  return decoded.append(encoded.substring(last, encoded.length())).toString();
}

private static byte[] toByteArray(String hexSequence) {
  byte[] utf8 = new byte[hexSequence.length() / 4];
  for (int i = 0; i < utf8.length; i++) {
    int offset = i * 4;
    String hex = hexSequence.substring(offset + 2, offset + 4);
    utf8[i] = (byte) Integer.parseInt(hex, 16);
  }
  return utf8;
}

Convert "php unicode" to character

2 Answers2

Linked