1

When I take the hash of an ordered set (ie numerical userids), the resulting distribution of their MD5 hashes is approximately uniformly distributed: ie. If I divide the hashes into n quantiles, low-numbered userids (10001)are equally likely to be found in a quantile of hashes as high-numbered userids (99999). On the other hand, if I do this with farm_fingerprint, the resulting buckets of hashes don't seem to be uniformly distributed: low-numbered buckets have more low-numbered userids. The documentation doesn't immediately mention the distribution properties of the hashes and I can't find it in the additional references.

I realize that a better way to uniformly distribute userids is by assigning a random number to each userid, as mentioned here; my questions is specifically regarding the distribution properties of the FARM_FINGERPRINT hash mentioned.

Below is an example query illustrating the relative skew in the quantiles:

    SELECT
      avg(n_low) as avg_n_low,
      avg(n_med) as avg_n_med,
      avg(n_high) as avg_n_high
    FROM (
      SELECT bucket_id,
        SUM(CASE WHEN label = 'low' THEN 1 ELSE 0 END) as n_low,
        SUM(CASE WHEN label = 'med' THEN 1 ELSE 0 END) as n_med,
        SUM(CASE WHEN label = 'high' THEN 1 ELSE 0 END) as n_high
      FROM (
        SELECT
          x,
          label,
          ntile(1000) OVER (ORDER by h) as bucket_id 
        FROM (
          SELECT x,
          CASE
            WHEN x BETWEEN 00001 and 20000 THEN 'low' 
            WHEN x BETWEEN 20001 and 40000 THEN 'med'
            WHEN x BETWEEN 40001 and 60000 THEN 'high' 
          END as label,
          --FARM_FINGERPRINT(CAST(x AS STRING)) h
          MD5(CAST(x AS STRING)) h
          FROM UNNEST((SELECT GENERATE_ARRAY(1,60000,10) xs)) AS x
        )
      )
      GROUP BY 1
    )
    WHERE
      bucket_id < 100
      --bucket_id > 900

I bucket 2000 'low', 'med', and 'high' users into 100 buckets. One can see that using the FARM_FINGERPRINT yields a higher variance in the bucket averages than MD5, ie. FARM_FINGERPRINT seems to have a higher avg_n_low for buckets < 100 than MD5 and higher avg_n_high for buckets > 900. ie. assignment to buckets via the hash is not as uniformly distributed as MD5.

I realize this is somewhat subjective at this point, please let me know if I'm missing something or can provide more details.

rmg
  • 1,009
  • 16
  • 31

2 Answers2

3

fingerprint functions are not in the same class as cryptographic hash functions. They are not trying to produce an output that is near random/unpredictable. In other words, it doesn't matter if the value is reversible. The requirement of a fingerprint function is to simply produce a deterministic unique hash value for every unique input (avoid collisions). Because of the simpler requirements, they should be faster than the cryptographic hashes and very useful for generating surrogate keys for large volumes of data, especially on MMP platforms. Note, if your natural key includes sensitive data, you probably don't want to go with farm_fingerprint. If you really need to generate a sequence, especially a densely packed sequence, I have used a sequence table (I had to do this to feed an id to a downstream system that could only take a 32 bit integer). Generic example (note, this is hive sql, bigquery analytic functions might be a little different):

create table entity_seq_xref
(  natural_key string
  ,sequence_id integer
);
insert into entity_seq_xref
select natural_key
    ,coalesce(last_sequence_id,0) + row_number() over (partition by 1) as sequence_id
from (select s.natural_key
            ,x.sequence_id
            ,max(x.sequence_id) over (partition by 1) as last_sequence_id
      from source_entity s
      left join entity_seq_xref x
          on x.natural_key = s.natural_key
) m
where x.sequence_id is null
;
1

Can you help us reproduce the issue?

I see a very regular distribution - at least when doing the AVG(ABS()) of the fingerprint results:

SELECT CEIL(x/10000000000), AVG(ABS(h)), COUNT(*)
FROM (
  SELECT x, FARM_FINGERPRINT(CAST(x AS STRING)) h
  FROM UNNEST((SELECT GENERATE_ARRAY(1,100000000000,100000) xs)) AS x
)
GROUP BY 1 
ORDER BY 2

enter image description here

Same with the range of the ids the question has:

SELECT CEIL(x/10000), AVG(ABS(h)), COUNT(*)
FROM (
  SELECT x, FARM_FINGERPRINT(CAST(x AS STRING)) h
  FROM UNNEST((SELECT GENERATE_ARRAY(10001,100000) xs)) AS x
)
GROUP BY 1 
ORDER BY 2

enter image description here

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Thank you, I've provided an example above – rmg Dec 04 '17 at 15:46
  • I believe part of the issue in your examples is that the input is perfectly uniformly distributed (ie. arrays with the same number of userids in each quantile spaced equally by 1000000). If the input distribution is not uniformly distributed, you start to see skews in the hash distribution more with farm_fingerprint than you do with MD5. – rmg Dec 05 '17 at 12:54
  • Thanks for adding a query, but I still don't see the problem. What numbers do you get as a result, that might seem disturbing? – Felipe Hoffa Dec 05 '17 at 15:28
  • In my example query, I split users into three ranges: low, med, and high. With a hash, I would expect it to map low userids equally over the hash space so that the average number of low userids in the low bucket_id range is about equal to the average number of high userids in a low bucket_id range. However, this isn't the case. Low users are more likely to get mapped to low hashes, while high users are more likely to get mapped to high hashes. That's what I've attempted to show with the query in my question: taking the average for the extrema of bucket_ids illustrates this skew. – rmg Dec 05 '17 at 16:03
  • if that was happening, why would the averages I get be all in the same range? – Felipe Hoffa Dec 05 '17 at 16:27
  • I believe it's because you're taking the AVG(ABS(h)), if you take AVG(h), you'll see that the averages of the hashes vary quite a bit. ie. similar ranges of x hash to similar ranges of h. Hence, ordering by h and taking quantiles, as we've done, produces more of less of CEIL(x/10000) depending on what quantile you're in. Hashing with MD5 and taking quantiles doesn't yield this type of skew in the distribution of hashes. I hope I've done a decent job of explaining what I mean. Thanks for taking the time to respond – rmg Dec 05 '17 at 17:40
  • Good observation. I'm doing AVG(ABS) because I'm assuming than an AVG of 64bit numbers won't work well if they are supposed to live in an unsigned space. See https://stackoverflow.com/questions/3816446/how-can-i-safely-average-two-unsigned-ints-in-c – Felipe Hoffa Dec 05 '17 at 23:27
  • As an alternative, get rid of the signed bit, and the average will again look good: AVG(h>>1) – Felipe Hoffa Dec 05 '17 at 23:38
  • Last experiment before I leave: Try this to see what's happening in this 64 bit space SELECT 1<<0, 1<<1, 1<<10, 1<<61, 1<<62, 1<<63, 1<<64 – Felipe Hoffa Dec 05 '17 at 23:43
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160600/discussion-between-rmg-and-felipe-hoffa). – rmg Dec 06 '17 at 11:22