1

I'm trying to replicate this hashing code in Python, but both languages handles bytes differently and generating very different outputs.

Can someone guide me here ?

Java Code (Original Code)

public static String hash(String filePath, String salt) {

        String finalHash = null;
        Path path = Paths.get(filePath);

        try {
            MessageDigest md = MessageDigest.getInstance("SHA-1");
            byte[] data = Files.readAllBytes(path);
            byte[] dataDigest = md.digest(data);
            byte[] hashDigest = md.digest(salt.getBytes("ISO-8859-1"));
            byte[] xorBytes = new byte[dataDigest.length];

            for (int i = 0; i < dataDigest.length && i < hashDigest.length; i++) {
                xorBytes[i] = (byte) (dataDigest[i] << 1 ^ hashDigest[i] >> 1);
            }

            finalHash = (new HexBinaryAdapter()).marshal(xorBytes);
        } catch (IOException | NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
        return finalHash;
    }

Python Code (Translated by me)

def generate_hash(file_path: str, salt: bytes) -> str:
    with open(file_path, 'rb') as f:
        data = f.read()

    hashed_file = sha1(data).digest()
    hashed_salt = sha1(salt).digest()

    xor_bytes = []

    for i in range(len(hashed_file)):
        xor_bytes.append((hashed_file[i] << 1 ^ hashed_salt[i] >> 1))

    return ''.join(map(chr, xor_bytes))  # This is probably not equivalent of HexBinaryAdapter
André Roggeri Campos
  • 889
  • 1
  • 11
  • 28

1 Answers1

2

There are the following issues:

  • The shift operations are wrongly implemented in the Python code:

    In the Python code the generated hash is stored in a bytes-like object as a list of unsigned integer values between 0 and 255 [1], e.g. 0xc8 = 11001000 = 200. In Java, integers are stored as signed values, whereby the two's complement is used to represent negative numbers [2][3]. The value 0x8c would be interpreted as -56 if stored in a byte variable.

    The >>-operator produces a different result on the binary level for signed and unsigned values, because it is an arithmetic shift operator which preserves the sign [4][5][6]. Example:

    signed       -56 >> 1 = 1110 0100 = -28
    unsigned     200 >> 1 = 0110 0100 = 100
    

    The <<-operator, on the other hand, does not cause the above problem, but can lead to values that cannot be represented by a byte. Example:

    signed       -56 << 1 = 1 1001 0000 = -112
    unsigned     200 << 1 = 1 1001 0000 = 400
    

    For these reasons, in the Python code the following line

    xor_bytes.append((hashed_file[i] << 1 ^ hashed_salt[i] >> 1))
    

    has to be replaced by

    xor_bytes.append((hashed_file[i] << 1 ^ tc(hashed_salt[i]) >> 1) & 0xFF)
    

    where

    def tc(val):
        if val > 127:
            val = val - 256
        return val
    

    determines the negative value of the two's complement representation (or more sophisticated with bitwise operators see [7]).

    The use of the bitwise and (&) with 0xFF ensures that only the relevant byte is taken into account in the Python code, analogous to the Java code [5].


  • There are several ways to convert the list/bytes-like object into a hexadecimal string (as in the Java code), e.g. with [8][9]

    bytes(xor_bytes).hex() 
    

    or with [8][10] (as binary string)

    binascii.b2a_hex(bytes(xor_bytes))
    


  • In the Python code the encoding of the salt must be taken into account. Since the salt is already passed as a binary string (in the Java code it is passed as a string), the encoding must be performed before the function is called:

    saltStr = 'MySalt'
    salt = saltStr.encode('ISO-8859-1')
    

    For a functional consistency with the Java code, the salt would have to be passed as a string and the encoding would have to be performed within the function.


Topaco
  • 40,594
  • 4
  • 35
  • 62
  • I was comparing the data right now and noticed the issue with bytes in java vs python. With the corrections supplied now the code returns the same output. Thank you very much !! – André Roggeri Campos Mar 07 '20 at 14:57