BACKGROUND/CONTEXT
If I define a sequence and a table as follows:
CREATE SEQUENCE public.test_sequence
CYCLE
MINVALUE 1
MAXVALUE 2147483647;
CREATE TABLE public.test_table
(
id integer NOT NULL DEFAULT nextval('test_sequence'::regclass),
name text,
CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
And insert a ton of values. Eventually I will reach the end of the sequence and because of the CYCLE keyword nextval will go back to the beginning of the sequence and once again return 1, 2, etc. Unfortunately, if I am inserting and the old record still exists with the same id, then I end up with a collision and the insert call will result in a "duplicate key violates unique constraint" error. I was somewhat surprised with this result in that I made the obviously incorrect assumption in believing that "nextval" was slightly more intelligent and meant the next free id value which would obviously be more useful in this context.
According to the following thread it appears that I am perhaps not alone in making this assumption and that others may be running into this same issue: How do I avoid nextval() colliding with existing entries on wrapping?
There are obviously a few possible workarounds to this problem:
- Migrate to a bigger data type such as bigint.
- Migrate to using UUID values instead.
- Introduce an ON CONFLICT DO NOTHING clause to the INSERTs.
- Introduce an ON CONFLICT DO UPDATE clause to the INSERTs.
- Add a BEFORE INSERT trigger to increment the nextval of the sequence until we reach a free one.
- etc.?
For those who are interested, the following article explores the differences between a few of the above options: Choosing a Postgres Primary Key
For a variety of reasons, and for better or worse, I chose option 5 to solve this issue. I would appreciate if we could avoid debating which is the better solution here. Especially considering the best choice can depend on a person's particular requirements. I would however be interested if there are other unique workarounds that I did not consider above?
To further clarify our requirements. We are running on an embedded device and space is finite. We have recently run out of space and we have barely started populating our major tables. Consequently, any option that increases our space usage is not necessarily our best option. We do not currently have any limits on our time so we can currently afford a performance penalty here. Consequently, for our needs we have already chosen option 5. For other people's requirements, any of the other above options may indeed be a better choice.
QUESTION
Considering that we have already chosen option 5. What is the best way to implement this?
I am interested in getting feedback on the database triggers I wrote to handle this situation. Specifically:
If anybody notices a concurrency problem with the locking solution I went with?
If there are better ways to write this trigger?
etc.
What are some of the gotchas/footguns of this approach?
A1 - @Ry- makes a good point that the key reuse can lead to some race conditions (if I'm paraphrasing correctly here). One particular scenario as an example is when you have 3 transactions: the first does a DELETE on id 5, the second later does an INSERT on id 5, the third is trying to do an UPDATE on id 5. Depending on when the third transaction arrives (before or after the other two), it will either update the old record or the new, one of which might not have been the intended update. Hence the race and possible data corruption. I suppose this particular case could possibly be mitigated by always doing a "SELECT FOR UPDATE" to check the previous state before doing the "UPDATE" (within the same transaction). However, this would obviously incur further performance overhead and risks users forgetting to do so.
A2 - @Atmo makes a good point that there will be a performance penalty for this function depending how full (or sparse), the values are after you wrap around. This culminates in a worst case scenario if there are no more free ids where you would deadlock if you are locking or infinite loop (at least until someone else deletes a record) if you are not locking. It should be noted that this time penalty is linear time as it only needs to check each record once and will only be hitting the index and not the records. Mitigation techniques for this scenario include regular cleanup (if possible) to maintain fewer entries in the table and checking for wrap around in the trigger to avoid deadlock/looping.
A3 - Cycling/wrapping the id values means that you can no longer rely on the id values for chronological ordering/sorting (if you happen to be doing so).
And here is the associated trigger code:
CREATE OR REPLACE FUNCTION fix_id_collision()
RETURNS TRIGGER AS $$
DECLARE
found_id integer;
column_default text;
BEGIN
-- Loop until we find a free id value.
LOOP
-- Check if the id already exists in the table.
-- Use row level "FOR UPDATE" locking to hopefully ensure
-- that concurrent INSERT queries don't receive the same id
-- and collide.
EXECUTE format('SELECT id FROM %I.%I WHERE id = %L FOR UPDATE', TG_TABLE_SCHEMA, TG_TABLE_NAME, NEW.id) INTO found_id;
IF found_id IS NULL THEN
RETURN NEW;
END IF;
EXECUTE format('SELECT column_default FROM
information_schema.columns WHERE table_schema=%L AND table_name=%L', TG_TABLE_SCHEMA, TG_TABLE_NAME) INTO column_default;
EXECUTE format('SELECT %s', column_default) INTO NEW.id;
END LOOP;
END;
$$
LANGUAGE plpgsql;
And some code to install it for all the tables in my database:
CREATE OR REPLACE FUNCTION install_id_collision_triggers()
RETURNS VOID
AS $$
DECLARE
tbl_schema text;
tbl_name text;
BEGIN
FOR tbl_schema, tbl_name IN SELECT table_schema, table_name FROM information_schema.columns WHERE column_name='id'
LOOP
EXECUTE format('DROP TRIGGER IF EXISTS id_collision_trigger_%s ON %I.%I', tbl_name, tbl_schema, tbl_name);
EXECUTE format('CREATE TRIGGER id_collision_trigger_%s BEFORE INSERT ON %I.%I FOR EACH ROW EXECUTE FUNCTION fix_id_collision()', tbl_name, tbl_schema, tbl_name);
END LOOP;
END;
$$
LANGUAGE plpgsql;
SELECT install_id_collision_triggers();
P.S. I am already aware that I can do a "CREATE OR REPLACE TRIGGER" in recent PostgreSQL versions but I happen to be using an older version for testing so please excuse the extra step.
FOLLOW UP
@klin makes an interesting point here that there might not be a space difference between using the integer and bigint/bigserial data types (on 64-bit systems) due to data type alignment issues as described in his following post: Benchmark: bigint vs int on PostgreSQL . I have run a very very basic test and confirm that there is only a very marginal increase in space usage between choosing integer vs bigint. It worked out to approximately 4% more space usage for bigint in my case. Consequently, there were not as great space savings as I originally thought in choosing this particular approach and I can recommend against using this workaround. Especially in consideration of the potential for data corruption that it introduces.