-1

I am trying to translate one PHP encoding function to Android Java method. Because Java string length function handles UTF-8 string differently. I failed to make the translated Java codes consistent with PHP code in converting the second UTF-8 str2. The first non UTF-8 string does work.

The original PHP codes are :

 function myhash_php($string,$key) {
    $strLen = strlen($string);
    $keyLen = strlen($key);
    $j=0 ; $hash = "" ; 
    for ($i = 0; $i < $strLen; $i++) {
        $ordStr = ord(substr($string,$i,1));
        if ($j == $keyLen) { $j = 0; }
        $ordKey = ord(substr($key,$j,1));
        $j++;
        $hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));

    }
    return $hash;  
}
$str1 = "good friend" ;
$str2 = "好友" ;    //  strlen($str2) == 6
$key  = "iuyhjf476" ;
echo "php encode str1 '". $str1 ."'=".myhash_php($str1, $key)."<br>";
echo "php encode str2 '". $str2 ."'=".myhash_php($str2, $key)."<br>";

PHP output are:

    php encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
    php encode str2 '好友'=a9u7m899x6p6

Current translated Java codes that produce wrong result are:

    public static String   hash_java(String  string, String  key) {
        //Integer strLen  = byteLenUTF8(string) ; // consistent with php strlen("好友")==6
        //Integer keyLen  = byteLenUTF8(key) ;    //   byteLenUTF8("好友") == 6
        Integer strLen  = string.length() ;      //     "好友".length()  ==  2
        Integer keyLen  = key.length() ;
        int j=0 ;
        String  hash = "" ;
        int ordStr, ordKey ;
        for (int i = 0; i < strLen; i++) {
            ordStr = ord_java(string.substring(i,i+1));  //string is String,  php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            // ordStr = ord_java(string[i]);  //string is byte[], php  substr($string,$i,$n)  ==  java string.substring(i, i+n)
            if (j == keyLen) { j = 0; }
            ordKey = ord_java(key.substring(j,j+1));
            j++;
            hash += strrev(base_convert(dechex(ordStr + ordKey),16,36));
        }
        return hash;
    }
    // return the ASCII code of the first character of str
    public static int      ord_java( String str){
        return( (int)  str.charAt(0)  ) ;
    }
    public static String   dechex(int input  ) {
        String hex  = Integer.toHexString(input ) ;
        return hex ;
    }
    public static String   strrev(String str){
        return  new StringBuilder(str).reverse().toString() ;
    }
    public static String   base_convert(String str, int fromBase, int toBase) {
        return Integer.toString(Integer.parseInt(str, fromBase), toBase);
    }

    String  str1 = "good friend" ;
    String  str2 = "好友" ;
    String  key  = "iuyhjf476" ;
    Log.d(LogTag,"java encode str1 '"+ str1  +"'="+hash_java(str1, key)) ;
    Log.d(LogTag,"java encode str2 '"+ str2  +"'="+hash_java(str2, key)) ;

Java output are:

java encode str1 'good friend'=s5c6g6o5u3o5m4g4b4z516
java encode str2 '好友'=arh4ng

The encoded output of UTF-8 str2 in Java method is not correct. How to fix the problem?

user2818066
  • 618
  • 2
  • 8
  • 19

2 Answers2

0

In Java, convert the string to a byte array, using UTF-8 character encoding. Then, apply your encoding algorithm to this byte array instead of the string.

Your PHP program seems to implicitly do the same thing, to treat e.g. the character as three individual byte values, according to UTF-8 encoding.

EDIT:

In the comments, you say you receive the string from the user entering it on Android. So, you start with a Java String coming from some UI widget.

And you need that Java String to give the same result that the given PHP function will produce when fed with the same UTF-8 string. The resulting string will only use ASCII characters, so its character encoding is less problematic (doesn't matter whetherit's e.g. ISO-8859-1 or UTF-8).

The PHP string datatype is ignorant about the encoding, just stores a sequence of bytes, so in general it might contain ISO-8859-1 bytes where one byte represents one character, or UTF-8 byte sequences, where characters often occupy multiple bytes, or any other encoding. The PHP string does not know how the bytes are meant to be interpreted as characters, it just sees and counts bytes.

So, what your PHP string calls "characters", effectively is the bytes of the UTF-8 encoding, and the Java side must emulate this behaviour when doing its algorithm.

Java has a String data type very different from PHP, not based on byte sequences, but (mainly) seeing a string as a sequence of characters. So, if you work with the characters of the Java String, you'll not see the same sequence of elements that PHP sees.

When Java iterates over a String like "好友", there are two steps, one for each of the two characters (seeing the character's Unicode code point number), while PHP has six steps, one for each byte of the UTF-8 representation, seeing the byte value.

So, to emulate PHP, in Java you have to convert the String to a byte[] array using UTF-8 encoding. This way, one Java byte will correspond to one PHP character.

Remark

By the way, the wording "UTF-8 string" does not make sense in Java.

That is different from PHP where e.g. "Maß" as ISO-8859-1 string (having a length of 3) differs from "Maß" as UTF-8 string (having a length of 4).

In Java, Strings are sequences of characters, and that's the reason why e.g. "好友" has a length of 2, as it's just two characters that happen to come from a non-Latin script. [This is true for most Unicode characters you'll typically encounter, but there are exceptions.] In Java, terms like UTF-8 matter only when converting between strings and byte sequences.

Ralf Kleberhoff
  • 6,990
  • 1
  • 13
  • 7
  • Using [literals](https://en.wikipedia.org/wiki/Literal_(computer_programming)) is the mistake - "converting" from `String` to `byte[]` is too late already. In both PHP and Java the file's text encoding could be anything - providing bytes unrelated to any encoding is the only way to be safe. – AmigoJack Nov 12 '20 at 10:27
  • @AmigoJack In general, I agree. But how does "file's text encoding" apply to the OP's question? And the literals are surely just test cases. – Ralf Kleberhoff Nov 12 '20 at 12:04
  • I forgot to put in strrev method in the code, I add it in. Now you can easily run the code. How to fix the java codes to make it consistent with PHP codes? – user2818066 Nov 12 '20 at 12:51
  • As I said, convert the string to a byte array , using UTF-8, and then replace all the current string operations with the equivalent byte array operations. – Ralf Kleberhoff Nov 12 '20 at 13:44
  • @RalfKleberhoff We neither see if the PHP file is saved in UTF-8 at all, nor can PHP guess the encoding - it must be set somewhere, too - and since it's nowhere in the file itself the PHP file is installation dependent. For tests one has to treat text like raw bytes, instead of assuming anything - otherwise the tests are misleading in both their operation and result. – AmigoJack Nov 12 '20 at 14:05
  • Les's be practical and consider to make the codes work in real application. There is no file stored in real application. In real application, users enter str1 or str2 in Android. Android encodes the string and sends it to php server. I need Java to enode the string in the same way as php does. I am not allowed to change php codes. The question is straight direct. How to make the Java codes encode the same way as php given the test strings? – user2818066 Nov 12 '20 at 23:54
  • In java, which way shall we go? char or byte instead of String ? – user2818066 Nov 13 '20 at 01:05
  • But what java codes can do the job ? Instead of writing so much explanation that I don't quite understand. Could you please simply make the java codes do the job? If someone knows how, it shall be a simple edit of the original java codes. – user2818066 Nov 13 '20 at 12:12
  • Sorry, stackoverflow is not a coding service. – Ralf Kleberhoff Nov 13 '20 at 13:29
0

Do not use literals for testing - this is prone to yield unexpected results if not fully being aware of what you do and how the file is encoded. For UTF-8 you should everything treat as raw bytes and never use a String for en/decoding. Example in PHP:

$test1 = pack( 'H*', '414243' );  // "ABC" in hexadecimal: 2 digits per byte
$test2 = pack( 'H*', 'e5a5bde58f8b' );  // "好友" in hexadecimal, UTF-8 encoded, 3 bytes per character

Example in Java:

byte[] test1 = new byte[] { 0x41, 0x42, 0x43 };  // "ABC"
byte[] test2 = new byte[] { (byte)0xe5, (byte)0xa5, (byte)0xbd, (byte)0xe5, (byte)0x8f, (byte)0x8b };  // "好友"

Only this way you can make sure your test is set up correctly and unbound to how the source file is encoded. If your Java file is encoded in UTF-8 and your PHP file is encoded in UTF-16LE then you'd fail even worse, simply because you didn't separate between definition (raw bytes) and assumption (strings based on the text encoding) so far.

(This is also a big misunderstanding when people want to en/decrypt texts: they operate on (any programming language's) String rather than the actual bytes and then wonder why different results occur with a different programming language.)

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
  • What? PHP is no application and (unless used strictly as a script interpreter) relies on HTTP (whose encoding you yet did not even question). Learn where one scope ends and another starts. Getting input in Java is easy - putting it into `String` leads to problems, as it can't be UTF-8 then - see https://stackoverflow.com/a/5729834/4299358 – AmigoJack Nov 13 '20 at 00:09