1

Taking the MD5 of a string as a 128-bit representation of an integer x, how do I compute x % y in Google Bigquery, where y will typically be relatively small (approx 1000)?

Bigquery has an MD5 function, returning a result of type BYTES with 16 bytes (i.e. 128 bits).

(Background: this is to compute deterministic pseudo random numbers. However, for legacy and compatibility reasons, I have no flexibility in the algorithm! Even though we know it has a (very slight) bias.)

This needs to be done millions/billions of times every day for different input strings and diffierent moduluses, so hopefully it can be done efficiently. As a fall back, I can compute it externally with another language, and then upload to Bigquery afterwards; but it would be great if I could do this directly in Bigquery.

I have studied a lot of number theory, so maybe we can use some mathematical tricks. However, I'm still stuck on more basic BiqQuery issues

  • How do I convert a bytes array to some sort of "big integer" type?
  • Can I access a subrange of the bytes from a BYTES array?
  • Given one byte (or maybe four bytes?), can I convert it to an integer type on which I can apply arithmetic operations?
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88

1 Answers1

1

With the power of math and a longish SQL function:

CREATE TEMP FUNCTION modulo_md5(str ANY TYPE, m ANY TYPE) AS ((
  SELECT MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(0 
    * 256 + num[OFFSET(0)], m ) 
    * 256 + num[OFFSET(1)], m )  
    * 256 + num[OFFSET(2)], m ) 
    * 256 + num[OFFSET(3)], m ) 
    * 256 + num[OFFSET(4)], m )  
    * 256 + num[OFFSET(5)], m ) 
    * 256 + num[OFFSET(6)], m ) 
    * 256 + num[OFFSET(7)], m )  
    * 256 + num[OFFSET(8)], m ) 
    * 256 + num[OFFSET(9)], m ) 
    * 256 + num[OFFSET(10)], m )  
    * 256 + num[OFFSET(11)], m ) 
    * 256 + num[OFFSET(12)], m ) 
    * 256 + num[OFFSET(13)], m )  
    * 256 + num[OFFSET(14)], m ) 
    * 256 + num[OFFSET(15)], m ) 
  FROM (SELECT TO_CODE_POINTS(MD5(str)) num)
));


SELECT title, modulo_md5(title, 177) result, TO_HEX(MD5(title)) md5
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
WHERE wiki='en'
LIMIT 100000

enter image description here

And now you can use it as a persistent shared UDF:

SELECT fhoffa.x.modulo_md5("any string", 177) result
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    Worked great thanks. And I'm surprised how fast it is, computing millions of entries per minute! I took a slightly different approach, using `to_code_points(md5(the_string))` to get directly to the array of integers. – Aaron McDaid Dec 23 '19 at 16:59