How to optimise anti-join SQL queries

Question

I have a database table where each entry is a resource that can be used once by each user. Each user can only use resources that they have not already used, but the same resource can be used by different users, so if I'm not mistaken, the users and resources are in a M:N relationship, this is why I have an association table between them. The schema looks like this:

CREATE TABLE users (
  id SERIAL NOT NULL,
  PRIMARY KEY(id)
)

CREATE TABLE resources (
  id VARCHAR NOT NULL,
  created_at TIMESTAMP,
  PRIMARY KEY(id)
)

CREATE TABLE resource_usage (
  resource_id VARCHAR NOT NULL,
  user_id SERIAL NOT NULL,
  created_at TIMESTAMP,
  PRIMARY KEY(resource_id, user_id),
  FOREIGN KEY(resource_id) REFERENCES resources (id),
  FOREIGN KEY(user_id) REFERENCES users (id)
)

I run the following SQL query (PostgreSQL 11.16) to select the first available resource for a specific user.

SELECT * FROM resources WHERE NOT EXISTS(
  SELECT 1 FROM resource_usage
  WHERE resources.id = resource_usage.resource_id
  AND resource_usage.user_id = 1
)
LIMIT 1

To add some context, there should never be a scenario where there is no available resource for a user. I keep metrics of available resources for each user and add more accordingly. The resources table contains a little over a million of rows, but it will keep growing. There is no explicit order in which the resources should be consumed.

As both the resources and resource_usage tables grow larger, the nested loop anti-join takes longer to complete (it takes longer to find the first available resource). My question is how do I write an index that would make this operation more efficient.

So far I have tried to create a sorting index where I sorted the resources by the time they were inserted. This made sense to me, because if the resource has only been added now, it's likely it wasn't used yet at all. This actually helped a little with the performance issue, but I don't add new resources very often and the performance starts to degrade quickly.

Is there any order to resources or used resources? Even an order that is not strictly enforced? Is it common that *all* resources are used already? Also: basic table definitions (`CREATE TABLE` statements), rough cardinalities for resources, users and used resources, and your version of Postgres, please. — Erwin Brandstetter, Mar 20 '23 at 13:37
Are you at liberty to enforce an order in which resources must be used? — Erwin Brandstetter, Mar 20 '23 at 13:45
I can enforce an order if necessary. I have already done that with the sorted index, it just didn't help as much as I expected. Also I have added context and schemas to my question. — Danosaurus, Mar 20 '23 at 13:59
If you can enforce an order then why don't you just store the id of the resource used last? Then you could select next resource upon minimum id being larger than this stored value – assuming you have strictly monotone ids – or if creation timestamps are unique, do the same with these. If you cannot get gaps within resources for deleting some of them you might even just increment the stored id by 1. — Aconcagua, Mar 20 '23 at 14:16
The resource ID is unfortunately not an auto-incremented integer value, it's a VARCHAR with no relation to other IDs. — Danosaurus, Mar 20 '23 at 14:19
Unlucky choice – I'd have preferred an integral id column with auto-increment and current id column renamed to `name` or similar with `UNIQUE` constraint. Is it yet feasible to update your table design that way? — Aconcagua, Mar 20 '23 at 14:25
This whole design is rather questionable. Why parcel out resources to start with, if you just monitor them and create them when you notice they are needed? Wouldn't it be better to just create them at the point of need, by not even giving them individual identifiers? — jjanes, Mar 20 '23 at 15:12

Erwin Brandstetter · Accepted Answer · 2023-03-20T15:00:09.003

Even with index support this kind of anti-join grows expensive with very big numbers of used resources.

Your example makes it seem arbitrary which resource to use next. So you might as well enforce a deterministic order in which resources have to be used. (And you commented that you can.)

You already mentioned an insertion timestamp for resources. While multiple resources can be entered at the same time, we might use something like (created_at, id) as deterministic sort order for resources. But created_at must be NOT NULL and values monotonically nondecreasing.

And we need an equally deterministic order for resource_usage. I see you already have a similar (distinct) column created_at. (created_at, id) can be the deterministic sort order per user_id for resource_usage.

Since new resources are always entered with a higher (created_at, id), we never miss a new resource, either.

Have these indexes:

CREATE INDEX foo ON resources (created_at, id);
CREATE INDEX bar ON resource_usage (user_id, created_at, id);

We also need the index that comes with the PK on resources(id).

Then this query is, and always will be, very fast:

SELECT id
FROM   resources r
WHERE  (created_at, id) > (
   SELECT COALESCE(
      (
      SELECT (r1.created_at, r1.id)
      FROM  (
         -- id of most recently used resource
         SELECT id
         FROM   resource_usage u
         WHERE  u.user_id = 1          -- user_id here!
         ORDER  BY created_at, id
         LIMIT  1
         ) u
      JOIN   resource r1 USING (id)
      )
    , ('-infinity', '')   -- matching data types!
      )
   )
ORDER  BY created_at, id
LIMIT  1;

Remember that resource_usage.created_at is distinct from resources.created_at. After sorting resource_usage, we need to grab the corresponding resources.created_at.

I threw in COALESCE to cover the case where a user hasn't used anything, yet, defaulting to a row that's guaranteed to be smaller than any existing row. Using ROW values for the purpose. About ROW value comparison:

SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

To rule out race conditions from concurrent write access, take a write lock on the user row in the same transaction before running this query. Like:

BEGIN;

SELECT FROM users WHERE id = 1 FOR NO KEY UPDATE;

-- above query

-- actually claim the resource by writing to resource_usage

COMMIT;

You have to consolidate existing data before you can start using this modus operandi. Make sure there is no older, unused resource for the same user.

About your attempt

This actually helped a little with the performance issue, but I don't add new resources very often and the performance starts to degrade quickly.

That's to be expected. This approach finds unused resources quickly at first. But, over time, it degrades to a worst-case scenario as the search always starts at the same point and ensures a maximum of rows to be skipped before finding the next free one.

Asides

PostgreSQL 11.16 is getting old (EOL Nov 2023). Upgrade to a current version at your earliest opportunity.

timestamptz is typically a better choice than timestamp if multiple time zone can be involved in any way.

How to optimise anti-join SQL queries

1 Answers1

About your attempt

Asides