2

I am currently trying to calculate the Hamming distance between two binary strings in BigQuery using User defined functions in Javascript, my schema is quite simple:

row_id        STRING    
descriptors   BYTES REPEATED    
phash         BYTES

What I am finding a bit confusing is the fact that you apparently deal with BYTES in BigQuery as a Base64 string, I imported both functions atob() and btoa() so I would be able to work with the binary form of the byte strings instead of the Base64 representation:

My Query currently looks like this:

CREATE TEMP FUNCTION f_PHASH_distance(ph1 BYTES, ph2 BYTES)
    RETURNS INT64
   LANGUAGE js AS
   """
        return HammingDistance(ph1, ph2);
   """
    OPTIONS (
        library=["gs://test.appspot.com/HammingDistance.js",
                 "gs://test.appspot.com/btoa_atob.js"]
    );

SELECT f_PHASH_distance(phash, CAST("9Slp3g9OgVI=" AS BYTES)) 
  FROM ims.images WHERE row_id = "2333USX"

And the row with id = "2333USX" phash is equal to "9Slp3g9OgVI=" in base64, which means that the Hamming distance is 0. But instead of 0 I am currently getting is 35 on BigQuery.

HammingDistance.js has the following content:

function HammingDistance(a, b){
    var count = 0;
    for(var i = 0; i < a.length; i++){
        // calculate XOR between the two chars
        var xor = a.charCodeAt(i) ^ b.charCodeAt(i);
        // count number of 1's on the result
        for(var j = 0; j < 16; j++){
            //add if LSB is 1
            count += xor % 2;
            //right shift the variable
            xor = xor >> 1;
        }
    }
    return count;
}

/**
 *  Calculates the distance between two Perceptual hashes of two images encoded
 *  in base 64
 */
function PHASHDistance(a, b){
    return HammingDistance(atob(a), atob(b));
} 

And testing it in the JS console of my browser I do get the expected result. So I assume that I am doing something wrong with the casts but the documentation is very scarce on UDFs with BYTE parameters.

Any help would be much appreciated.

clds
  • 171
  • 2
  • 13

2 Answers2

3

It looks like the problem is that you are casting "9Slp3g9OgVI=" to bytes rather than converting it to bytes from base64. I think you want this instead:

SELECT f_PHASH_distance(phash, FROM_BASE64("9Slp3g9OgVI=")) 
FROM ims.images WHERE row_id = "2333USX"

You might be better off using SQL functions rather than JavaScript functions, though, since JavaScript normally isn't as fast. Here's a Hamming distance implementation in SQL, assuming that the bytes have equal lengths:

#standardSQL
CREATE TEMP FUNCTION HammingDistance(b1 BYTES, b2 BYTES) AS (
  BIT_COUNT(b1 ^ b2)
);

WITH Input AS (
  SELECT b'defdef' AS bytes UNION ALL
  SELECT b'123de4' UNION ALL
  SELECT b'abc123'
)
SELECT HammingDistance(b'abcdef', bytes)
FROM Input;

It takes the bitwise XOR of the two byte values, then checks how many bits are not the same.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • Elliot, it seems like there is a small problems with the SQL HammingDistance, it fails for the following example: `CREATE TEMP FUNCTION HammingDistance(b1 BYTES, b2 BYTES) AS ( (SELECT COUNTIF(c != 0) FROM UNNEST(TO_CODE_POINTS(b1 ^ b2)) AS c) ); SELECT HammingDistance(FROM_BASE64("MA=="), FROM_BASE64("Mw=="))` "MA==" being the base 64 representation of 0b110000 and "Mw==" the base 64 representation of 0b110011. It looks like this behaviour comes from the fact that TO_CODE_POINTS returns a array of chars and not an array of bits. – clds Jul 12 '17 at 13:33
  • Using `BIT_COUNT(b1 ^ b2)` seems to solve the problem. – clds Jul 12 '17 at 14:07
  • Sorry, I thought the intention was to get a signal of whether each pair of bytes matched. I updated my answer to use BIT_COUNT with xor instead. – Elliott Brossard Jul 12 '17 at 15:27
-1

In case someone is looking for a solution in the case of comparing regular strings (not binary ones as this question), look at my answer here

jquinter
  • 144
  • 4