12

I want to use the MD5 message digest of some string as the primary key of a table. What datatype should I use for such a field? What select and insert statements should I write for the field?

Brad Koch
  • 19,267
  • 19
  • 110
  • 137
象嘉道
  • 3,657
  • 5
  • 33
  • 49

2 Answers2

14

The md5 hash as bytea will use only 16 bytes in instead of 32 for the hexa representation:

create table t (d bytea);
insert into t (d) values
    (digest('my_string', 'md5')),
    (decode(md5('my_string'), 'hex'));

Both forms above will work but to use the simpler digest function it is necessary to install the pgcrypto extension as superuser:

create extension pgcrypto;

Use the digest function or the combination of decode and md5 as above to search for a certain string:

select
    octet_length(d) ba_length,
    pg_column_size(d) ba_column,
    encode(d, 'hex') hex_representation,
    octet_length(encode(d, 'hex')) h_length,
    pg_column_size(encode(d, 'hex')) h_column
from t
where d = digest('my_string', 'md5')
;
 ba_length | ba_column |        hex_representation        | h_length | h_column 
-----------+-----------+----------------------------------+----------+----------
        16 |        17 | 3d212b21fad7bed63c1fb560c6a5c5d0 |       32 |       36
        16 |        17 | 3d212b21fad7bed63c1fb560c6a5c5d0 |       32 |       36

The pg_column_size value is the storage size. It is less than half for the bytea compared to the hexa representation.

Simon B.
  • 2,530
  • 24
  • 30
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
3

bytea has a one byte overhead, but with padding to eight bytes this will result in significant wastage.

Instead, consider using the uuid type, which uses just 16 bytes. You'll have to use something like REPLACE(md5::text, '-', '') as md5 when selecting it, but that should be a quick operation.

GreenReaper
  • 1,121
  • 1
  • 15
  • 23
  • Do you have a reference for 'padding to 8 bytes'? All [the docs say](http://www.postgresql.org/docs/9.4/static/datatype-binary.html) is that storage size = "1 or 4 bytes plus the actual binary string" –  Nov 15 '14 at 06:54
  • Two issues: a) you are storing 1+16=17 bytes, the subsequent column will be padded per typalign in [pg_type](http://www.postgresql.org/docs/9.0/static/catalog-pg-type.html), b) see [this question](http://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql) regarding MAXALIGN and the [page layout reference](http://www.postgresql.org/docs/9.4/static/storage-page-layout.html) for *row* alignment. ["The actual user data (columns of the row) begins at the offset indicated by t_hoff, which must always be a multiple of the MAXALIGN distance for the platform." == 8 on x64]. – GreenReaper Jan 21 '15 at 12:06
  • In this specific case, row space usage would likely be 23+1 (row header + null header for up to 8 columns) + 4 (ba_length) + 4 (ba_column) + 1+16 (hex_representation) + 3 (alignment padding for h_length) + 4 (h_length) + 4 (h_column) = 60 + 4 (row padding). That's how you get 1 + 3 + 4 = 8 bytes more than using uuid. Bottom line: if you care about space, you have to care about your row layout. Usually putting large fields first is close to optimal, although if it's larger than 8 bytes you might reconsider. – GreenReaper Jan 21 '15 at 12:21
  • Interesting - though the numbers would be different depending on the `typalign` value of the next column in the table, right? –  Jan 21 '15 at 13:43
  • Yes, potentially. A small change could have a big impact, or none at all. Another thing to consider is that nullable fields may take up no space in a particular row instance, so if you are laying out a table for maximum space savings you need to consider the layout with and without them. (Of course, if there are more than eight columns and one is null, the null bitmap no longer takes up a "free" byte at the end of the 23-byte header but increases the row size by MAXALIGN due to user data alignment - potentially causing a significant and confusing increase in table size upon adding a column.) – GreenReaper Jan 22 '15 at 14:37
  • Null can result in significant savings if you're willing to abuse it. [In one case](https://inkbunny.net/journalview.php?id=121417) I saved eight bytes on 70% of rows in a 12-million row table by using coalesce() to allow storage of null values as a replacement for the most popular value. Obviously this is only something to even consider when you will realize significant savings and have full control over the application. – GreenReaper Jan 22 '15 at 15:18
  • Thanks I'm grateful to be introduced to these issues which I'd never considered before. You might find some of the questions on DBA.SE interesting, like [this old one](http://dba.stackexchange.com/q/8803/1396) for example. –  Jan 22 '15 at 17:38