17

The PostgreSQL types bytea and bit varying sound similar:

The documentation does not mention a maximum size for either. Is it 1GB like character varying?

I have two separate use cases, both over a table with millions of rows:

Storing MD5 hashes

That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:

  • Deduplication: Heavy use of GROUP BY, with an index I suppose.
  • Querying with WHERE md5 = for exact matches only.
  • Displaying as a hex string for human use.

Storing arbitrary binary data

Strings of binary data of varying length up to 4kB for:

  • Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
  • Extracting some bytes, for instance get the integer value of the byte 14 in my string.
  • Some deduplication.

Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?

CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';

Which of bytea and bit varying is the more appropriate?

I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

Victor
  • 1,680
  • 3
  • 22
  • 40
  • I'm not putting this as an answer since I'm no expert here, but in reading about this, it appears that bytea is the more appropriate choice, specifically using it's HEX format. In PostgreSQL's own words "The hex format is compatible with a wide range of external applications and protocols, and it tends to be faster to convert than the escape format, so its use is preferred. http://www.postgresql.org/docs/9.1/static/datatype-binary.html (specifically section 8.4.1) and a swell answer here: http://stackoverflow.com/questions/15982737/postgresql-data-type-for-md5-message-digest – JNevill Oct 29 '14 at 16:33
  • I understood the `hex` and `escape` are [external formats for input and output](http://www.postgresql.org/docs/9.3/static/datatype-binary.html#DATATYPE-BINARY-TABLE) only. The internal values are binary so it wouldn't change anything in my case unless I'm importing or exporting a massive number of binary values. – Victor Oct 29 '14 at 19:35
  • Well.. internally it's all binary anyway. It seems like it comes down to the support in your RDBMS and whatever tools you use to interact with it, which is what pushed me to `bytea`. Theoretically a Char(16) and bit(128) take the same number of bytes to store, so it's sort of a wash from the internal perspective (ignoring that there are octets that CHAR() can't store). I'm surprised non of the stackoverflow.com super geniuses have answered this one yet. It's a good question. – JNevill Oct 29 '14 at 19:52
  • @JNevill `char(16)` is variable in storage length as it is represented internally as UTF-8 and some unicode characters take more thn 1 byte. `char` in postgres is not the same as `char` in C. `bytea` is a closer match to C's `char[]` – Jasen Jun 28 '17 at 03:59
  • UUID has the advantage that it uses 'plain; storage which is slightly more efficient than externded storage which bytea and bit and bit() use. and the disadvantage that you're not suing it for UUIDs. – Jasen Jun 28 '17 at 04:04

2 Answers2

11

In general, if you're not using bitwise operations you should be using bytea.

I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.

MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.

I think bit varying really has two uses:

  • To store flags fields that are literally bit strings; and
  • As an interim data type for internal calculations

For pretty much everything else, use bytea.

There's nothing stopping you storing a 4k bitfield if that's what it is, though.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • To what degree is your recommendation based on the lower overhead for binary strings (1 or 4 bytes) in comparison to bit strings (5 or 8 bytes)? – David J. Sep 17 '16 at 02:14
  • @DavidJames Mostly that it's logically cleaner, and there's no good reason to use bit strings here. Use bit strings if bitwise operations are needed, otherwise don't. – Craig Ringer Sep 19 '16 at 05:15
  • Hi, about use with short values (ranging from 32 to 1024 bits) as *primary key*? It is better `bytea` or `bit varying`? (and perhaps fixed length `bit` also considered). – Peter Krauss May 08 '19 at 10:15
  • @PeterKrauss If you mean short-integer, then yes, since it's not a varlena type it'll perform better in general. – Craig Ringer May 09 '19 at 10:07
9
  1. It appears the maximum length of bytea is 1 GB. [1]
  2. For bitwise operation use bit varying (explanation see below)
  3. For storing MD5 hash use bytea. It will take less storage than bit varying
  4. The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).

For bitwise operation use bit varying

If you concern about storage: bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:

bit varying vs bytea

So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.

PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]

Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]

Comparison between UUID and bytea

  create table u(uuid uuid primary key, payload character(300));
  create table b( bytea bytea primary key, payload character(300));

  INSERT INTO u                                                  
  SELECT uuid_generate_v4()                                                     
  FROM generate_series(1,1000*1000);

  INSERT INTO b                                                   
  SELECT random_bytea(16)                                                       
  FROM generate_series(1,1000*1000);

  VACUUM ANALYZE u;
  VACUUM ANALYZE b;

  ## Your table size
  SELECT pg_size_pretty(pg_total_relation_size('u'));
  pg_size_pretty 
  ---------------- 
  81 MB

  SELECT pg_size_pretty(pg_total_relation_size('b'));
  pg_size_pretty 
  ---------------- 
  101 MB

  ## Speed comparison
  \timing on

  ## Common select
  select * from u limit 1000;
  Time: 1.433 ms

  select * from b limit 1000;
  Time: 1.396 ms

  ## Random Select
  SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
  Time: 42.453 ms

  SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
  Time: 10.962 ms

Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)

Note: No Index, there is only one connection

Some source :

  1. PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book
Victor
  • 1,680
  • 3
  • 22
  • 40
Bagus Trihatmaja
  • 805
  • 1
  • 9
  • 26
  • "Note that you cannot compare two bytea value, even for equality/inequality" -- is this true? I seem to be able to via something like `WHERE x > decode('013d7d16d7ad4fefb61bd95b765c8ceb', 'hex')`. edit: or even `SELECT * FROM test WHERE x > y` where x and y are bytea columns – ZachM Jul 02 '18 at 15:53