Hashing a String to a Numeric Value in PostgreSQL

Question

I need to Convert Strings stored in my Database to a Numeric value. Result can be Integer (preferred) or Bigint. This conversion is to be done at Database side in a PL/pgSQL function.

Can someone please point me to some algorithm or any API's that can be used to achieve this?

I have been searching for this on Google for hours now, could not find anything useful so far :(

No. I mean strings with Characters and Symbols. What I am looking for a good hashing algorithm that has very low Collision and can generate a hash in Int or Bigint data type. — Salman A. Kagzi, Mar 21 '12 at 17:49

score 73 · Accepted Answer · answered Mar 21 '12 at 20:06

73

Just keep the first 32 bits or 64 bits of the MD5 hash. Of course, it voids the main property of md5 (=the probability of collision being infinitesimal) but you'll still get a wide dispersion of values which presumably is good enough for your problem.

SQL functions derived from the other answers:

For bigint:

create function h_bigint(text) returns bigint as $$
 select ('x'||substr(md5($1),1,16))::bit(64)::bigint;
$$ language sql;

For int:

create function h_int(text) returns int as $$
 select ('x'||substr(md5($1),1,8))::bit(32)::int;
$$ language sql;

answered Mar 21 '12 at 20:06

Daniel Vérité

58,074
15
129
156

3

What does the `('x'||` conditional do? – Michael Yoo Dec 05 '16 at 21:47
2

@MichaelYoo: the `||` operator is for string concatenation (only MySQL and programming languages inspired by C interpret `||` as a logical OR, it is not the standard interpretation in SQL). – Daniel Vérité Dec 06 '16 at 11:55
4

I see. I think it's just to "unhex" the hex representation in to binary, which is then cast to bits, is that right? (because Postgres doesn't have unhex?) The examples I see always put the 'x' outside of quotes like x'1b1c2f', how does prepending a string literal 'x' work in this case? i.e. would CONCAT() also work? – Michael Yoo Dec 08 '16 at 14:28

score 17 · Answer 2 · edited May 23 '17 at 12:34

17

You can create a md5 hash value without problems:

select md5('hello, world');

This returns a string with a hex number.

Unfortunately there is no built-in function to convert hex to integer but as you are doing that in PL/pgSQL anyway, this might help:

https://stackoverflow.com/a/8316731/330315

edited May 23 '17 at 12:34

Community

1
1

answered Mar 21 '12 at 17:57

1

Output of md5 will have substringed to avoid Integer Out of range error. – Salman A. Kagzi Mar 22 '12 at 06:22

score 16 · Answer 3 · answered Oct 20 '21 at 17:58

PostgreSQL has hash functions for many column types. You can use hashtext if you want integer hash values, or hashtextextended if you prefer bigint hash values.

Note that hashXXXextended functions take one extra parameter for seed and 0 means that no seed will be used.

In the commit message the author Robert Haas says:

Just in case somebody wants a 64-bit hash value that is compatible with the existing 32-bit hash values, make the low 32-bits of the 64-bit hash value match the 32-bit hash value when the seed is 0.

Example

postgres=# SELECT hashtextextended('test string of type text', 0);
   hashtextextended
----------------------
 -6578719834206879717
(1 row)

postgres=# SELECT hashtext('test string of type text');
  hashtext
-------------
 -1790427109
(1 row)

What about other data types?

You can check all available hash functions on your backend by checking the output for \df hash*. Below you can see the functions available in PG 14.0.

hanefi=# \df hash*
                                     List of functions
   Schema   |           Name           | Result data type |   Argument data types    | Type
------------+--------------------------+------------------+--------------------------+------
 pg_catalog | hash_aclitem             | integer          | aclitem                  | func
 pg_catalog | hash_aclitem_extended    | bigint           | aclitem, bigint          | func
 pg_catalog | hash_array               | integer          | anyarray                 | func
 pg_catalog | hash_array_extended      | bigint           | anyarray, bigint         | func
 pg_catalog | hash_multirange          | integer          | anymultirange            | func
 pg_catalog | hash_multirange_extended | bigint           | anymultirange, bigint    | func
 pg_catalog | hash_numeric             | integer          | numeric                  | func
 pg_catalog | hash_numeric_extended    | bigint           | numeric, bigint          | func
 pg_catalog | hash_range               | integer          | anyrange                 | func
 pg_catalog | hash_range_extended      | bigint           | anyrange, bigint         | func
 pg_catalog | hash_record              | integer          | record                   | func
 pg_catalog | hash_record_extended     | bigint           | record, bigint           | func
 pg_catalog | hashbpchar               | integer          | character                | func
 pg_catalog | hashbpcharextended       | bigint           | character, bigint        | func
 pg_catalog | hashchar                 | integer          | "char"                   | func
 pg_catalog | hashcharextended         | bigint           | "char", bigint           | func
 pg_catalog | hashenum                 | integer          | anyenum                  | func
 pg_catalog | hashenumextended         | bigint           | anyenum, bigint          | func
 pg_catalog | hashfloat4               | integer          | real                     | func
 pg_catalog | hashfloat4extended       | bigint           | real, bigint             | func
 pg_catalog | hashfloat8               | integer          | double precision         | func
 pg_catalog | hashfloat8extended       | bigint           | double precision, bigint | func
 pg_catalog | hashhandler              | index_am_handler | internal                 | func
 pg_catalog | hashinet                 | integer          | inet                     | func
 pg_catalog | hashinetextended         | bigint           | inet, bigint             | func
 pg_catalog | hashint2                 | integer          | smallint                 | func
 pg_catalog | hashint2extended         | bigint           | smallint, bigint         | func
 pg_catalog | hashint4                 | integer          | integer                  | func
 pg_catalog | hashint4extended         | bigint           | integer, bigint          | func
 pg_catalog | hashint8                 | integer          | bigint                   | func
 pg_catalog | hashint8extended         | bigint           | bigint, bigint           | func
 pg_catalog | hashmacaddr              | integer          | macaddr                  | func
 pg_catalog | hashmacaddr8             | integer          | macaddr8                 | func
 pg_catalog | hashmacaddr8extended     | bigint           | macaddr8, bigint         | func
 pg_catalog | hashmacaddrextended      | bigint           | macaddr, bigint          | func
 pg_catalog | hashname                 | integer          | name                     | func
 pg_catalog | hashnameextended         | bigint           | name, bigint             | func
 pg_catalog | hashoid                  | integer          | oid                      | func
 pg_catalog | hashoidextended          | bigint           | oid, bigint              | func
 pg_catalog | hashoidvector            | integer          | oidvector                | func
 pg_catalog | hashoidvectorextended    | bigint           | oidvector, bigint        | func
 pg_catalog | hashtext                 | integer          | text                     | func
 pg_catalog | hashtextextended         | bigint           | text, bigint             | func
 pg_catalog | hashtid                  | integer          | tid                      | func
 pg_catalog | hashtidextended          | bigint           | tid, bigint              | func
 pg_catalog | hashvarlena              | integer          | internal                 | func
 pg_catalog | hashvarlenaextended      | bigint           | internal, bigint         | func
(47 rows)

Caveats

If you want to have consistent hashes across different systems, make sure you have the same collation behaviour.

The built-in collatable data types are text, varchar, and char. If you have different collation options, you can see different hash values. For example the sort order of the strings ‘a-a’ and ‘a+a’ flipped in glibc 2.28 (Debian 10, RHEL 8) compared to earlier releases.

If you wish to use the hashes on the same machine, you need not worry, as long as you do not update glibc or use a different collation.

See more details at: https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/

If you want to combine several values into a single hash, you can chain them by making one hash the seed for the next: `hashfloat8extended(1.0, hashfloat8(2.0))` — hfs, Jul 15 '22 at 17:33
according to this discussion https://www.postgresql.org/message-id/CABUevExTx2whgSpKaoMVowDxBe%3Dpm7w4LJkb%3D-k8NTohQT12Kg%40mail.gmail.com hash**** family of functions are for internal use and might change in future versions producing different hashes for the same input. imo its risky to rely on them in a production system. — IulianT, Oct 21 '22 at 11:34

dbenhur · Answer 4 · 2012-03-21T18:53:55.793

Must it be an integer? The pg_crypto module provides a number of standard hash functions (md5, sha1, etc). They all return bytea. I suppose you could throw away some bits and convert bytea to integer.

bigint is too small to store a cryptographic hash. The largest non-bytea binary type Pg supports is uuid. You could cast a digest to uuid like this:

select ('{'||encode( substring(digest('foobar','sha256') from 1 for 16), 'hex')||'}')::uuid;
                 uuid                 
--------------------------------------
 c3ab8ff1-3720-e8ad-9047-dd39466b3c89

score 3 · Answer 5 · answered Oct 25 '19 at 21:13

This is an implementation of Java's String.hashCode():

CREATE OR REPLACE FUNCTION hashCode(_string text) RETURNS INTEGER AS $$
DECLARE
  val_ CHAR[];
  h_ INTEGER := 0;
  ascii_ INTEGER;
  c_ char;
BEGIN
  val_ = regexp_split_to_array(_string, '');

  FOR i in 1 .. array_length(val_, 1)
  LOOP
    c_ := (val_)[i];
    ascii_ := ascii(c_);
    h_ = 31 * h_ + ascii_;
    raise info '%: % = %', i, c_, h_;
  END LOOP;
RETURN h_;
END;
$$ LANGUAGE plpgsql;

Hashing a String to a Numeric Value in PostgreSQL

5 Answers5

Example

What about other data types?

Caveats

Linked