With a many-to-many relationship, search by the many for the one

Question

I have a many-to-many relationship between releases and artifacts, where a given release is associated with multiple artifacts, and a given artifact is associated with multiple releases.

I understand how to model this: I have a releases table with an ID column:

CREATE TABLE releases (
    release_uuid uuid PRIMARY KEY
);

and an artifacts table with an ID column:

CREATE TABLE artifacts (
    artifact_uuid uuid PRIMARY KEY,
    hash          bytea
    -- other data
);

and a joining table release_artifacts that has foreign key columns from each of the others:

CREATE TABLE release_artifacts (
    id            serial PRIMARY KEY,
    release_uuid  uuid REFERENCES releases(release_uuid) NOT NULL,
    artifact_uuid uuid REFERENCES artifacts(artifact_uuid) NOT NULL,
    UNIQUE (release_uuid, artifact_uuid)
);

What I want to do is find a release "containing" a given set of artifacts, so that I can warn about duplicate releases. That is, for artifacts A1, A2, and A3, what release(s) Rx is defined by exactly those three artifacts? More visually, given the release_artifacts table:

release ID | artifact ID
-----------+------------
R1         | A1
R1         | A2
R1         | A3
R2         | A4
R2         | A2
R2         | A3

what search can I perform with A1, A2, A3 as the input that would give me back R1? A search on A2, A3 would return NULL. Or do I need a different model? I assume this would be easier if the release_artifacts table mapped a release to an array of artifact IDs, but then I lose the referential integrity with the artifact table.

I don't need maximum performance or maximal concurrency protection, but I'd be happy if those things don't significantly increase the complexity of the query. This is in a Postgres 9.6 database, though I'd consider that a version floor.

score 3 · Accepted Answer · answered Jun 10 '19 at 22:14

3

You can use aggregation:

select release_id
from release_artifacts
group by release_id
having sum( artifact_id in ('A1', 'A2', 'A3') ) = 3 and
       count(*) = 3;

This assumes no duplicates.

Or you can use string or array aggregation:

select release_id
from release_artifacts
group by release_id
having string_agg(artifact_id order by artifact_id) = 'A1,A2,A3';

answered Jun 10 '19 at 22:14

Gordon Linoff

1,242,037
58
646
786

I got my query working with the second variant, though I used `array_agg()` instead, because I couldn't quickly figure out how to do the `ORDER BY` in the first argument of `string_agg()` as well as have the second (delimiter) arg. I also had to find a way to sort the input array, but I did manage that. I still plan on noodling around with @Erwin's answer, and I'll try out your first variant as well. – Danek Duvall Jun 11 '19 at 05:32

Erwin Brandstetter · Answer 2 · 2019-06-14T23:37:59.310

This is a case of relational-division. Here is an arsenal of basic techniques:

How to filter SQL results in a has-many-through relation

For your given (typical) many-to-many setup, this is among the fastest possible queries:

SELECT release_id
FROM   release_artifacts ra1
JOIN   release_artifacts ra2 USING (release_id)
JOIN   release_artifacts ra3 USING (release_id)
WHERE  ra1.artifact_id = 'A1' 
AND    ra2.artifact_id = 'A2' 
AND    ra3.artifact_id = 'A3';

The downside of this query: you have to adjust the build for the number of artifacts you are looking for. If it's always 3, there is no downside at all.

For a dynamic number of artifacts, you might build the query dynamically. Or use a recursive CTE as instructed here (recommended!):

Using same column multiple times in WHERE clause

It helps performance quite a bit to have the constraint (and its implementing index) on (artifact_id, release_id) and not the other way round on (release_id, artifact_id), as the first and (hopefully) most selective predicate is on artifact_id. It often pays to have an additional index on the reverse combination to cover all bases. See:

To additionally limit the search to releases with the exact given set of artifacts (and no additional ones) - like you commented:

SELECT release_id
FROM   release_artifacts ra1
JOIN   release_artifacts ra2 USING (release_uuid)
JOIN   release_artifacts ra3 USING (release_uuid)
WHERE  ra1.artifact_uuid = 'A1' 
AND    ra2.artifact_uuid = 'A2'
AND    ra2.artifact_uuid = 'A3'
AND    NOT EXISTS (      -- no other artifacts
   SELECT FROM release_artifacts rax
   WHERE  rax.release_uuid   = ra1.release_uuid
   AND    rax.artifact_uuid <> ra1.artifact_uuid
   AND    rax.artifact_uuid <> ra2.artifact_uuid
   AND    rax.artifact_uuid <> ra3.artifact_uuid
   );

Alternatively:

   ...
   AND    rax.artifact_uuid <> ALL ('{A1, A2, A3}'::uuid[])
   );

Or with LEFT JOIN / IS NULL. See:

Select rows which are not present in other table

Should only cost slightly more and scale in similar fashion.

I tried the four combinations: both orientations of the constraint index and both the array aggregation solution @Gordon provided and your set of joins (which I can probably use by dynamically constructing the query). I ran a test adding 10k releases of 5 artifacts each. As you predicted, the join version with the (artifact, release) index was the fastest, but it was also constant time across all additions. Reversing the index caused it to grow slightly (~10x). The array version had the reverse issue with the index, grew more rapidly (~40-80x), and ran into memory issues far sooner. — Danek Duvall, Jun 14 '19 at 19:08
Graphs of those four experiments: https://pasteboard.co/IjpO0N4.png https://pasteboard.co/IjpPaAF.png https://pasteboard.co/IjpPMpQ.png https://pasteboard.co/IjpQp1T.png. I'm assuming that when the graph gets extra noisy (the spikes of over 100ms) that it's because Postgres couldn't do the query entirely in memory, but that's just conjecture. — Danek Duvall, Jun 14 '19 at 19:11
@DanekDuvall: Very interesting feedback. Good to see it actually performs as advertised. Spikes can be due to a number of reasons. Competing Postgres activity, other processes, `VACUUM`, flushing cache, hardware issues, ... — Erwin Brandstetter, Jun 14 '19 at 23:09
I'll note that this style doesn't work exactly like the array aggregation above. If I create a release `R1` with `{A1,A2,A3}`, and then want to create a new release with `{A1,A2}`, this series of joins will tell me a release already exists with those artifacts. That's correct in that formulation of the question, but I only want it to tell me the release exists if it has exactly the incoming set. — Danek Duvall, Jun 14 '19 at 23:10
@DanekDuvall: Good point. Can be done with similar performance as well. Building the query dynamically, the addition is also trivial. Better yet: use a recursive CTE instead to cover any number of artifacts with a single query. See addition above. — Erwin Brandstetter, Jun 14 '19 at 23:39

With a many-to-many relationship, search by the many for the one

2 Answers2