-2

I would like to know whether multiple similar exists conditions can be combined in a meaningful and performant way.

Let us assume the following example: Different activities can be assigned to a service (n-m). Activities can be grouped independently into activity groups. Activity groups can be assigned to a group type.

If I now want to find all services that have a reference to certain group types and I want to link the condition by OR, then this is relatively simple by combining EXISTS and IN.

select *
from service
where exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                        activitiy_group.id_type in (1, 3)
                        and activitiy.id_service = service.id
                    );

If, on the other hand, I want to link the condition by AND, then it is not quite so simple. I could add multiple EXITS conditions:

select *
from service
where exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                        activitiy_group.id_type = 1
                        and activitiy.id_service = service.id
                    )
and
exists (
          select 1
          from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
          where (
                            activitiy_group.id_type = 3
                        and activitiy.id_service = service.id
                    );

But I wonder if this approach is performant for many filter elements. I experimented a bit and one approach would be with only one subselect, by selecting all distinct activity group type ids related to a service into one array and comparing it with the filter values:

select *
from service
where true =
      (select ARRAY_AGG(activitiy_group.id_type) @> ('{1,3}'::Integer[])
       from activitiy
               join activitiy_activitiy_group
                    on activitiy.id = activitiy_activitiy_group.id_activitiy
                    join activitiy_group
                        on activitiy_activitiy_group.id_activitiy_group = activitiy_group.id
       where ativitiy.id_service = service.id);

But here, too, the question arises as to whether this is really performant. Can anyone assess this, or is there perhaps a more sensible alternative approach? I assumed that the underlying basic problem is a standard problem, but unfortunately could not find any other approach on the internet.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Philipp Li
  • 499
  • 6
  • 22
  • Basically, the `AND` case burns down to a [tag:relational-division] problem. See: https://stackoverflow.com/a/7774879/939860 *"... whether this is really performant"* - The best solution always depends on the actual, relevant environment. We need the range of possible input, what's rare and what's common, cardinalities, table definitions, Postgres version, ... Follow instructions given here: https://stackoverflow.com/tags/postgresql-performance/info – Erwin Brandstetter Aug 31 '23 at 11:21
  • Also, are involved tables mostly read-only or written a lot? – Erwin Brandstetter Aug 31 '23 at 11:28
  • Yes the involved tables are mostly read-only. I will try out your suggested approach. Thank you very much. – Philipp Li Sep 01 '23 at 11:11

1 Answers1

1

I'll pick a setting where the involved tables are big and writes are comparatively rare. Then it makes sense to create an auxiliary MATERIALIZED VIEW (once) to dramatically speed up the query:

CREATE MATERIALIZED VIEW service_activity_types AS
SELECT a.id_service, array_agg(ag.id_type) AS activity_types
FROM  (
   SELECT DISTINCT a.id_service, ag.id_type
   FROM   activitiy                 a
   JOIN   activitiy_activitiy_group aag ON aag.id_activitiy = a.id
   JOIN   activitiy_group           ag  ON ag.id = aag.id_activitiy_group
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1;

Generates a table with unique services with an array of unique activity types.

(It's typically faster to apply DISTINCT and ORDER BY once in a subquery.)

Create a unique index on service_activity_types to allow refreshing the MV CONCURRENTLY:

CREATE UNIQUE INDEX service_activity_type_uni ON service_activity_types (id_service);

Refresh after impactful changes to underlying tables:

REFRESH MATERIALIZED VIEW CONCURRENTLY service_activity_types;

Create a index on the array column to make the query fast. There are various options. For your case, I expect a GIN index using the operator class gin__int_ops from the additional module intarray to be fastest. Install the module once per database first. See:

CREATE INDEX service_activity_type_gin_idx ON service_activity_types USING gin (activity_types gin__int_ops);

Maybe even a multicolumn index. See:

Also, to get your column statistics started:

ANALYZE service_activity_types;

Then your query can be:

SELECT id_service
FROM   service_activity_types
WHERE  activity_types @> '{1,3}';

And it will be blazingly fast.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228