When I take the hash of an ordered set (ie numerical userids), the resulting distribution of their MD5 hashes is approximately uniformly distributed: ie. If I divide the hashes into n quantiles, low-numbered userids (10001)are equally likely to be found in a quantile of hashes as high-numbered userids (99999). On the other hand, if I do this with farm_fingerprint, the resulting buckets of hashes don't seem to be uniformly distributed: low-numbered buckets have more low-numbered userids. The documentation doesn't immediately mention the distribution properties of the hashes and I can't find it in the additional references.
I realize that a better way to uniformly distribute userids is by assigning a random number to each userid, as mentioned here; my questions is specifically regarding the distribution properties of the FARM_FINGERPRINT hash mentioned.
Below is an example query illustrating the relative skew in the quantiles:
SELECT
avg(n_low) as avg_n_low,
avg(n_med) as avg_n_med,
avg(n_high) as avg_n_high
FROM (
SELECT bucket_id,
SUM(CASE WHEN label = 'low' THEN 1 ELSE 0 END) as n_low,
SUM(CASE WHEN label = 'med' THEN 1 ELSE 0 END) as n_med,
SUM(CASE WHEN label = 'high' THEN 1 ELSE 0 END) as n_high
FROM (
SELECT
x,
label,
ntile(1000) OVER (ORDER by h) as bucket_id
FROM (
SELECT x,
CASE
WHEN x BETWEEN 00001 and 20000 THEN 'low'
WHEN x BETWEEN 20001 and 40000 THEN 'med'
WHEN x BETWEEN 40001 and 60000 THEN 'high'
END as label,
--FARM_FINGERPRINT(CAST(x AS STRING)) h
MD5(CAST(x AS STRING)) h
FROM UNNEST((SELECT GENERATE_ARRAY(1,60000,10) xs)) AS x
)
)
GROUP BY 1
)
WHERE
bucket_id < 100
--bucket_id > 900
I bucket 2000 'low', 'med', and 'high' users into 100 buckets. One can see that using the FARM_FINGERPRINT
yields a higher variance in the bucket averages than MD5
, ie. FARM_FINGERPRINT
seems to have a higher avg_n_low for buckets < 100 than MD5 and higher avg_n_high for buckets > 900. ie. assignment to buckets via the hash is not as uniformly distributed as MD5
.
I realize this is somewhat subjective at this point, please let me know if I'm missing something or can provide more details.