Among the many things we do with Postgres at work, we use it as a cache for certain kinds of remote requests. Our schema is:
CREATE TABLE IF NOT EXISTS cache (
key VARCHAR(256) PRIMARY KEY,
value TEXT NOT NULL,
ttl TIMESTAMP DEFAULT NULL
);
CREATE INDEX IF NOT EXISTS idx_cache_ttl ON cache(ttl);
This table does not have triggers or foreign keys. Updates are typically:
INSERT INTO cache (key, value, ttl)
VALUES ('Ethan is testing8393645', '"hi6286166"', sec2ttl(300))
ON CONFLICT (key) DO UPDATE
SET value = '"hi6286166"', ttl = sec2ttl(300);
(Where sec2ttl
is defined as:)
CREATE OR REPLACE FUNCTION sec2ttl(seconds FLOAT)
RETURNS TIMESTAMP AS $$
BEGIN
IF seconds IS NULL THEN
RETURN NULL;
END IF;
RETURN now() + (seconds || ' SECOND')::INTERVAL;
END;
$$ LANGUAGE plpgsql;
Querying the cache is done in a transaction like this:
BEGIN;
DELETE FROM cache WHERE ttl IS NOT NULL AND now() > ttl;
SELECT value FROM cache WHERE key = 'Ethan is testing6460437';
COMMIT;
There are a few things not to like about this design -- the DELETE
happening in cache "reads", the index on (edit: ASC is the default, thanks wargre!) plus the fact that we're using Postgres as a cache at all. But all of that would have been acceptable except that we've started getting deadlocks in production, which tend to look like this:cache.ttl
is not ascending which makes it kind of useless,
ERROR: deadlock detected
DETAIL: Process 12750 waits for ShareLock on transaction 632693475; blocked by process 10080.
Process 10080 waits for ShareLock on transaction 632693479; blocked by process 12750.
HINT: See server log for query details.
CONTEXT: while deleting tuple (426,1) in relation "cache"
[SQL: 'DELETE FROM cache WHERE ttl IS NOT NULL AND now() > ttl;']
Investigating the logs more thoroughly indicates that both transactions were performing this DELETE
operation.
As far as I can tell:
- My transactions are in
READ COMMITTED
isolation mode. - ShareLocks are grabbed by one transaction to indicate that it wants to mutate rows that another transaction has mutated (i.e. locked).
- Based on the output of an
EXPLAIN
query, the ShareLocks should be grabbed by bothDELETE
transactions in physical order. - The deadlock indicates that both queries locked rows in a different order.
If all that is correct, then somehow some simultaneous transaction has changed the physical order of rows. I see that an UPDATE
can move a row to an earlier or later physical position, but in my application, the UPDATE
s always remove rows from consideration by the DELETE
s (because they're always extending a row's TTL). If the rows were previously in physical order, and you remove one, then you're still left with physical order. Similarly for DELETE
. We're not doing any VACUUM
or any other operation which you might expect to reorder rows.
Based on Avoiding PostgreSQL deadlocks when performing bulk update and delete operations, I tried to change the DELETE
queries to:
DELETE FROM cache c
USING (
SELECT key
FROM cache
WHERE ttl IS NOT NULL AND now() > ttl
ORDER BY ttl ASC
FOR UPDATE
) del
WHERE del.key = c.key;
However, I'm still able to get deadlocks locally. So generally, how can two DELETE
queries deadlock? Is it because they're locking in an undefined order, and if so, how do I enforce a specific order?