I am currently trying to calculate the Hamming distance between two binary strings in BigQuery using User defined functions in Javascript, my schema is quite simple:
row_id STRING
descriptors BYTES REPEATED
phash BYTES
What I am finding a bit confusing is the fact that you apparently deal with BYTES in BigQuery as a Base64 string, I imported both functions atob()
and btoa()
so I would be able to work with the binary form of the byte strings instead of the Base64 representation:
My Query currently looks like this:
CREATE TEMP FUNCTION f_PHASH_distance(ph1 BYTES, ph2 BYTES)
RETURNS INT64
LANGUAGE js AS
"""
return HammingDistance(ph1, ph2);
"""
OPTIONS (
library=["gs://test.appspot.com/HammingDistance.js",
"gs://test.appspot.com/btoa_atob.js"]
);
SELECT f_PHASH_distance(phash, CAST("9Slp3g9OgVI=" AS BYTES))
FROM ims.images WHERE row_id = "2333USX"
And the row with id = "2333USX" phash is equal to "9Slp3g9OgVI=" in base64, which means that the Hamming distance is 0. But instead of 0 I am currently getting is 35 on BigQuery.
HammingDistance.js has the following content:
function HammingDistance(a, b){
var count = 0;
for(var i = 0; i < a.length; i++){
// calculate XOR between the two chars
var xor = a.charCodeAt(i) ^ b.charCodeAt(i);
// count number of 1's on the result
for(var j = 0; j < 16; j++){
//add if LSB is 1
count += xor % 2;
//right shift the variable
xor = xor >> 1;
}
}
return count;
}
/**
* Calculates the distance between two Perceptual hashes of two images encoded
* in base 64
*/
function PHASHDistance(a, b){
return HammingDistance(atob(a), atob(b));
}
And testing it in the JS console of my browser I do get the expected result. So I assume that I am doing something wrong with the casts but the documentation is very scarce on UDFs with BYTE parameters.
Any help would be much appreciated.