2

I'm trying to write a Java equivalent to PHP's ord():

public static int ord(char c) {
    return (int) c;
}

public static int ord(String s) {
    return s.length() > 0 ? ord(s.charAt(0)) : 0;
}

This seems to works well for characters with an ordinal value of up to 127, i.e. within ASCII. However, PHP returns 195 (and higher) for characters from the extended ASCII table or beyond. A comment by Mr. Llama to the answer on a related question explains this as follows:

To elaborate, the reason é showed ASCII 195 is because it's actually a two-byte character (UTF-8), the first byte of which is ASCII 195. – Mr. Llama

I hence changed my ord(char c) method to mask out all but the most significant byte:

public static int ord(char c) {
    return (int) (c & 0xFF);
}

Still, the results differ. Two examples:

  • ord('é') (U+00E9) gives 195 in PHP while my Java function yields 233
  • ord('⸆') (U+2E06) gives 226 in PHP while my Java function yields 6

I manged to get the same behavior for the method that accepts a String by first turning the String into a byte array, explicitly using UTF-8 encoding:

public static int ord(String s) {
    return s.length() > 0 ? ord((char)s.getBytes(StandardCharsets.UTF_8)[0]) : 0;
}

However, using the method that accepts a char still behaves as before and I could not yet find a solution for that. In addition, I don't understand why the change actually worked: Charset.defaultCharset() returns UTF-8 on my platform anyway. So...

  • How can I make my function behave similar to that of PHP?
  • Why does the change to ord(String s) actually work?

Explanatory answers are much appreciated, as I want to grasp what's going on exactly.

Community
  • 1
  • 1
domsson
  • 4,553
  • 2
  • 22
  • 40
  • 2
    Java appears to be correct; 233 is indeed the code for `é`: http://www.ascii-code.com/. 195 is the code for `Ã`, so who knows WTF is going on under-the-hood in PHP. – Oliver Charlesworth Apr 18 '17 at 21:27
  • Actually, seems to be heavily related to this: http://stackoverflow.com/questions/35575721/ord-doesnt-work-with-utf-8 – Oliver Charlesworth Apr 18 '17 at 21:29
  • @OliverCharlesworth that is correct, PHP's `ord()` does not work correctly with characters outside the ASCII range. However, I'm trying to replicate that behavior. – domsson Apr 18 '17 at 21:33

1 Answers1

4

In Java a char is a UTF-16 code unit. Converting UTF-16 to UTF-8 is not just & 0xFF, for instance 01FF in UTF-16 is C7 BF in UTF-8, so the PHP ord() should give 0xC7 (199), yet 0x01FF & 0xFF is 255.

The String version works because it is actually transforming to UTF-8.

The simplest way is to reverse your two overloads, since String has a convenient method to get UTF-8:

public static int ord(String s) {
    return s.length() > 0 ? (s.getBytes(StandardCharsets.UTF_8)[0] & 0xff) : 0;
}

and convert the char to a String:

public static int ord(char c) {
    return c < 0x80 ? c : ord(Character.toString(c))
}

While this works, it is not quite efficient because of the unnecessary char→String→int conversion. The first byte of the UTF-8 encoding of a Unicode code-point c can be actually be found using:

if (c < 0x80) {
    return c;
} else if (c < 0x800) {
    return 0xc0 | c >> 6;
} else if (c < 0x10000) {
    return 0xe0 | c >> 12; 
} else {
    return 0xf0 | c >> 18;
}

You may also want to read What is Unicode, UTF-8, UTF-16? for some background information.

Community
  • 1
  • 1
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Ah, that explains everything! Terrific answer, thank you. I'm using Java for about 3 years now and never knew it used UTF-16 internally. I'll make sure to read through the linked references carefully. – domsson Apr 19 '17 at 07:26