55

I need to Convert Strings stored in my Database to a Numeric value. Result can be Integer (preferred) or Bigint. This conversion is to be done at Database side in a PL/pgSQL function.

Can someone please point me to some algorithm or any API's that can be used to achieve this?

I have been searching for this on Google for hours now, could not find anything useful so far :(

Salman A. Kagzi
  • 3,833
  • 13
  • 45
  • 64

5 Answers5

73

Just keep the first 32 bits or 64 bits of the MD5 hash. Of course, it voids the main property of md5 (=the probability of collision being infinitesimal) but you'll still get a wide dispersion of values which presumably is good enough for your problem.

SQL functions derived from the other answers:

For bigint:

create function h_bigint(text) returns bigint as $$
 select ('x'||substr(md5($1),1,16))::bit(64)::bigint;
$$ language sql;

For int:

create function h_int(text) returns int as $$
 select ('x'||substr(md5($1),1,8))::bit(32)::int;
$$ language sql;
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • 3
    What does the `('x'||` conditional do? – Michael Yoo Dec 05 '16 at 21:47
  • 2
    @MichaelYoo: the `||` operator is for string concatenation (only MySQL and programming languages inspired by C interpret `||` as a logical OR, it is not the standard interpretation in SQL). – Daniel Vérité Dec 06 '16 at 11:55
  • 4
    I see. I think it's just to "unhex" the hex representation in to binary, which is then cast to bits, is that right? (because Postgres doesn't have unhex?) The examples I see always put the 'x' outside of quotes like x'1b1c2f', how does prepending a string literal 'x' work in this case? i.e. would CONCAT() also work? – Michael Yoo Dec 08 '16 at 14:28
17

You can create a md5 hash value without problems:

select md5('hello, world');

This returns a string with a hex number.

Unfortunately there is no built-in function to convert hex to integer but as you are doing that in PL/pgSQL anyway, this might help:

https://stackoverflow.com/a/8316731/330315

Community
  • 1
  • 1
16

PostgreSQL has hash functions for many column types. You can use hashtext if you want integer hash values, or hashtextextended if you prefer bigint hash values.

Note that hashXXXextended functions take one extra parameter for seed and 0 means that no seed will be used.

In the commit message the author Robert Haas says:

Just in case somebody wants a 64-bit hash value that is compatible with the existing 32-bit hash values, make the low 32-bits of the 64-bit hash value match the 32-bit hash value when the seed is 0.

Example

postgres=# SELECT hashtextextended('test string of type text', 0);
   hashtextextended
----------------------
 -6578719834206879717
(1 row)

postgres=# SELECT hashtext('test string of type text');
  hashtext
-------------
 -1790427109
(1 row)

What about other data types?

You can check all available hash functions on your backend by checking the output for \df hash*. Below you can see the functions available in PG 14.0.

hanefi=# \df hash*
                                     List of functions
   Schema   |           Name           | Result data type |   Argument data types    | Type
------------+--------------------------+------------------+--------------------------+------
 pg_catalog | hash_aclitem             | integer          | aclitem                  | func
 pg_catalog | hash_aclitem_extended    | bigint           | aclitem, bigint          | func
 pg_catalog | hash_array               | integer          | anyarray                 | func
 pg_catalog | hash_array_extended      | bigint           | anyarray, bigint         | func
 pg_catalog | hash_multirange          | integer          | anymultirange            | func
 pg_catalog | hash_multirange_extended | bigint           | anymultirange, bigint    | func
 pg_catalog | hash_numeric             | integer          | numeric                  | func
 pg_catalog | hash_numeric_extended    | bigint           | numeric, bigint          | func
 pg_catalog | hash_range               | integer          | anyrange                 | func
 pg_catalog | hash_range_extended      | bigint           | anyrange, bigint         | func
 pg_catalog | hash_record              | integer          | record                   | func
 pg_catalog | hash_record_extended     | bigint           | record, bigint           | func
 pg_catalog | hashbpchar               | integer          | character                | func
 pg_catalog | hashbpcharextended       | bigint           | character, bigint        | func
 pg_catalog | hashchar                 | integer          | "char"                   | func
 pg_catalog | hashcharextended         | bigint           | "char", bigint           | func
 pg_catalog | hashenum                 | integer          | anyenum                  | func
 pg_catalog | hashenumextended         | bigint           | anyenum, bigint          | func
 pg_catalog | hashfloat4               | integer          | real                     | func
 pg_catalog | hashfloat4extended       | bigint           | real, bigint             | func
 pg_catalog | hashfloat8               | integer          | double precision         | func
 pg_catalog | hashfloat8extended       | bigint           | double precision, bigint | func
 pg_catalog | hashhandler              | index_am_handler | internal                 | func
 pg_catalog | hashinet                 | integer          | inet                     | func
 pg_catalog | hashinetextended         | bigint           | inet, bigint             | func
 pg_catalog | hashint2                 | integer          | smallint                 | func
 pg_catalog | hashint2extended         | bigint           | smallint, bigint         | func
 pg_catalog | hashint4                 | integer          | integer                  | func
 pg_catalog | hashint4extended         | bigint           | integer, bigint          | func
 pg_catalog | hashint8                 | integer          | bigint                   | func
 pg_catalog | hashint8extended         | bigint           | bigint, bigint           | func
 pg_catalog | hashmacaddr              | integer          | macaddr                  | func
 pg_catalog | hashmacaddr8             | integer          | macaddr8                 | func
 pg_catalog | hashmacaddr8extended     | bigint           | macaddr8, bigint         | func
 pg_catalog | hashmacaddrextended      | bigint           | macaddr, bigint          | func
 pg_catalog | hashname                 | integer          | name                     | func
 pg_catalog | hashnameextended         | bigint           | name, bigint             | func
 pg_catalog | hashoid                  | integer          | oid                      | func
 pg_catalog | hashoidextended          | bigint           | oid, bigint              | func
 pg_catalog | hashoidvector            | integer          | oidvector                | func
 pg_catalog | hashoidvectorextended    | bigint           | oidvector, bigint        | func
 pg_catalog | hashtext                 | integer          | text                     | func
 pg_catalog | hashtextextended         | bigint           | text, bigint             | func
 pg_catalog | hashtid                  | integer          | tid                      | func
 pg_catalog | hashtidextended          | bigint           | tid, bigint              | func
 pg_catalog | hashvarlena              | integer          | internal                 | func
 pg_catalog | hashvarlenaextended      | bigint           | internal, bigint         | func
(47 rows)

Caveats

If you want to have consistent hashes across different systems, make sure you have the same collation behaviour.

The built-in collatable data types are text, varchar, and char. If you have different collation options, you can see different hash values. For example the sort order of the strings ‘a-a’ and ‘a+a’ flipped in glibc 2.28 (Debian 10, RHEL 8) compared to earlier releases.

If you wish to use the hashes on the same machine, you need not worry, as long as you do not update glibc or use a different collation.

See more details at: https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/

  • 1
    If you want to combine several values into a single hash, you can chain them by making one hash the seed for the next: `hashfloat8extended(1.0, hashfloat8(2.0))` – hfs Jul 15 '22 at 17:33
  • 2
    according to this discussion https://www.postgresql.org/message-id/CABUevExTx2whgSpKaoMVowDxBe%3Dpm7w4LJkb%3D-k8NTohQT12Kg%40mail.gmail.com hash**** family of functions are for internal use and might change in future versions producing different hashes for the same input. imo its risky to rely on them in a production system. – IulianT Oct 21 '22 at 11:34
5

Must it be an integer? The pg_crypto module provides a number of standard hash functions (md5, sha1, etc). They all return bytea. I suppose you could throw away some bits and convert bytea to integer.

bigint is too small to store a cryptographic hash. The largest non-bytea binary type Pg supports is uuid. You could cast a digest to uuid like this:

select ('{'||encode( substring(digest('foobar','sha256') from 1 for 16), 'hex')||'}')::uuid;
                 uuid                 
--------------------------------------
 c3ab8ff1-3720-e8ad-9047-dd39466b3c89
dbenhur
  • 20,008
  • 4
  • 48
  • 45
3

This is an implementation of Java's String.hashCode():

CREATE OR REPLACE FUNCTION hashCode(_string text) RETURNS INTEGER AS $$
DECLARE
  val_ CHAR[];
  h_ INTEGER := 0;
  ascii_ INTEGER;
  c_ char;
BEGIN
  val_ = regexp_split_to_array(_string, '');

  FOR i in 1 .. array_length(val_, 1)
  LOOP
    c_ := (val_)[i];
    ascii_ := ascii(c_);
    h_ = 31 * h_ + ascii_;
    raise info '%: % = %', i, c_, h_;
  END LOOP;
RETURN h_;
END;
$$ LANGUAGE plpgsql;
dvlcube
  • 1,117
  • 1
  • 12
  • 20