12

I've started working on a project where there is a fairly large table (about 82,000,000 rows) that I think is very bloated. One of the fields is defined as:

consistency character varying NOT NULL DEFAULT 'Y'::character varying

It's used as a boolean, the values should always either be ('Y'|'N').

Note: there is no check constraint, etc.

I'm trying to come up with reasons to justify changing this field. Here is what I have:

  • It's being used as a boolean, so make it that. Explicit is better than implicit.
  • It will protect against coding errors because right now there anything that can be converted to text will go blindly in there.

Here are my question(s).

  • What about size/storage? The db is UTF-8. So, I think there really isn't much of a savings in that regard. It should be 1 byte for a boolean, but also 1 byte for a 'Y' in UTF-8 (at least that's what I get when I check the length in Python). Is there any other storage overhead here that would be saved?
  • Query performance? Will Postgres get any performance gains for a where cause of "=TRUE" vs. "='Y'"?
David S
  • 12,967
  • 12
  • 55
  • 93

2 Answers2

24

PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!

What about size/storage?

A boolean column occupies 1 byte on disk.
(The manual) about text or character varying:

the storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string

That's at least 2 bytes for a single character.

Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:

Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.

In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.

Query performance?

Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.

Instead of:

WHERE consistency = 'Y'

You could write:

WHERE consistency = true

But rather simplify to just:

WHERE consistency

No further evaluation needed.

Change type

Transforming your table is simple:

ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;

This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 3
    @DavidS Note that changing the type via `ALTER TABLE` will exclusively lock the whole table while it's rewritten. If you can't afford that you can do it in steps: `ALTER TABLE` to add a new nullable boolean field; `UPDATE` to populate that field; `ALTER TABLE` to make it `NOT NULL`, `ALTER TABLE` to drop the old varchar column and `ALTER TABLE` to rename the new boolean column to the varchar column's name. You can add a trigger that traps new `INSERT`s, `UPDATE`s and `DELETE`s while the rewrite is in progress and makes sure they set the new boolean col too. – Craig Ringer Oct 11 '12 at 01:32
3

Neither storage size nor query performance will be significantly better switching from a single VARCHAR to a BOOLEAN. Although you are right that it's technically cleaner to use boolean when you are talking about a binary value the cost to change is probably significantly higher than the benefit. If you're worried about correctness then you could put a check on the column, for example

ALTER TABLE tablename ADD CONSTRAINT consistency CHECK (consistency IN ('Y', 'N'));