Postgres query involving joins and count too slow

Question

This query below takes way too long to execute

Number of records in each table:

stories - 416,355 records
event_relations - 24,050,862 records
districts - 10 records
industries - 100 records
industry_districts - 40 records
subsectors - 200 records
supply - 2,869 records

select s.id, s.description, count(distinct et.id) from 
"district s" s 
inner join 
(
    select id.id from "industry_districts" id
) as id2 on id2.id = s.id
inner join 
(
    select id, industry_districts_id from "industries" i
) as i2 on i2.industry_districts_id = id2.id
inner join 
(
    select id, industries_id from "subsectors" si
) as si2 on si2.industries_id = i2.id
inner join 
(
    select id, subsectors_id from "supplys" rs 
) as rs2 on rs2.subsectors_id = si2.id
inner join 
(
    select id, supplys_id, events_id from "event_relations" ers 
) as ers2 on ers2.supplys_id = rs2.id
left join (
    select id, events_id from "stories"  
) et on et.events_id = ers2.events_id
group by s.id
limit 10 offset 0

**Explain analyze gives the following result **

Limit  (cost=0.99..20719301.15 rows=10 width=130) (actual time=2107.465..214891.582 rows=10 loops=1)
  ->  GroupAggregate  (cost=0.99..24863161.18 rows=12 width=130) (actual time=2107.462..214891.567 rows=10 loops=1)
        Group Key: s.id
        ->  Nested Loop Left Join  (cost=0.99..23466091.79 rows=279413855 width=126) (actual time=2.073..141429.178 rows=104217089 loops=1)
              ->  Nested Loop  (cost=0.57..607232.85 rows=11644034 width=126) (actual time=2.055..62770.206 rows=8174966 loops=1)
                    ->  Nested Loop  (cost=0.14..3598.24 rows=1389 width=126) (actual time=2.009..56.815 rows=927 loops=1)
                          Join Filter: (si.id = rs.subindustries_id)
                          Rows Removed by Join Filter: 226384
                          ->  Nested Loop  (cost=0.14..117.68 rows=77 width=126) (actual time=1.004..2.502 rows=80 loops=1)
                                Join Filter: (i.id = si.industries_id)
                                Rows Removed by Join Filter: 5309
                                ->  Nested Loop  (cost=0.14..32.60 rows=34 width=126) (actual time=0.513..0.775 rows=34 loops=1)
                                      Join Filter: (ig.id = i.industry_groups_id)
                                      Rows Removed by Join Filter: 671
                                      ->  Nested Loop  (cost=0.14..18.13 rows=12 width=126) (actual time=0.443..0.507 rows=11 loops=1)
                                            Join Filter: (s.id = ig.id)
                                            Rows Removed by Join Filter: 126
                                            ->  Index Scan using "PK_Sectors" on "Sectors" s  (cost=0.14..12.31 rows=12 width=122) (actual time=0.402..0.417 rows=11 loops=1)
                                            ->  Materialize  (cost=0.00..1.38 rows=25 width=4) (actual time=0.002..0.004 rows=12 loops=11)
                                                  ->  Seq Scan on "Industry_Groups" ig  (cost=0.00..1.25 rows=25 width=4) (actual time=0.011..0.019 rows=25 loops=1)
                                      ->  Materialize  (cost=0.00..2.05 rows=70 width=8) (actual time=0.001..0.013 rows=64 loops=11)
                                            ->  Seq Scan on "Industries" i  (cost=0.00..1.70 rows=70 width=8) (actual time=0.007..0.023 rows=70 loops=1)
                                ->  Materialize  (cost=0.00..4.38 rows=159 width=8) (actual time=0.010..0.028 rows=158 loops=34)
                                      ->  Seq Scan on "Subindustries" si  (cost=0.00..3.59 rows=159 width=8) (actual time=0.345..0.386 rows=159 loops=1)
                          ->  Materialize  (cost=0.00..174.03 rows=2869 width=8) (actual time=0.007..0.307 rows=2841 loops=80)
                                ->  Seq Scan on "supplys" rs  (cost=0.00..159.69 rows=2869 width=8) (actual time=0.578..4.098 rows=2869 loops=1)
                    ->  Index Scan using "IXFK_event_relations_supplys" on "event_relations" ers  (cost=0.44..350.07 rows=8451 width=8) (actual time=0.103..65.007 rows=8819 loops=927)
                          Index Cond: (supplys_id = rs.id)
              ->  Index Scan using "IXFK_Event_Tweets_Katalyst_Events" on "Event_Tweets"  (cost=0.42..1.40 rows=56 width=8) (actual time=0.003..0.006 rows=12 loops=8174966)
                    Index Cond: (katalyst_events_id = ers.katalyst_events_id)
Planning Time: 16.839 ms
Execution Time: 214891.868 ms

explain(analyze, verbose, buffers):

Limit  (cost=0.99..20719301.15 rows=10 width=130) (actual time=2751.614..222343.984 rows=10 loops=1)
  Output: s.id, s.description, (count(DISTINCT "stories".id))
  Buffers: shared hit=29275971 read=6562620, temp read=331820 written=332375
  ->  GroupAggregate  (cost=0.99..24863161.18 rows=12 width=130) (actual time=2751.601..222342.173 rows=10 loops=1)
        Output: s.id, s.description, count(DISTINCT "stories".id)
        Group Key: s.id
        Buffers: shared hit=29275971 read=6562620, temp read=331820 written=332375
        ->  Nested Loop Left Join  (cost=0.99..23466091.79 rows=279413855 width=126) (actual time=2.785..147622.650 rows=104217089 loops=1)
              Output: s.id, s.description, "stories".id
              Buffers: shared hit=29275971 read=6562620
              ->  Nested Loop  (cost=0.57..607232.85 rows=11644034 width=126) (actual time=2.755..65731.047 rows=8174966 loops=1)
                    Output: s.id, s.description, ers.events_id
                    Buffers: shared hit=2084415 read=6047838
                    ->  Nested Loop  (cost=0.14..3598.24 rows=1389 width=126) (actual time=2.013..60.815 rows=927 loops=1)
                          Output: s.id, s.description, rs.id
                          Join Filter: (si.id = rs.subdistricts_id)
                          Rows Removed by Join Filter: 226384
                          Buffers: shared hit=5 read=132
                          ->  Nested Loop  (cost=0.14..117.68 rows=77 width=126) (actual time=0.984..2.532 rows=80 loops=1)
                                Output: s.id, s.description, si.id
                                Join Filter: (i.id = si.industries_id)
                                Rows Removed by Join Filter: 5309
                                Buffers: shared hit=5 read=1
                                ->  Nested Loop  (cost=0.14..32.60 rows=34 width=126) (actual time=0.840..1.093 rows=34 loops=1)
                                      Output: s.id, s.description, i.id
                                      Join Filter: (ig.id = i.industry_districts_id)
                                      Rows Removed by Join Filter: 671
                                      Buffers: shared hit=3 read=1
                                      ->  Nested Loop  (cost=0.14..18.13 rows=12 width=126) (actual time=0.759..0.820 rows=11 loops=1)
                                            Output: s.id, s.description, ig.id
                                            Inner Unique: true
                                            Join Filter: (s.id = ig.id)
                                            Rows Removed by Join Filter: 126
                                            Buffers: shared hit=2 read=1
                                            ->  Index Scan using "PK_districts" on public."districts" s  (cost=0.14..12.31 rows=12 width=122) (actual time=0.379..0.392 rows=11 loops=1)
                                                  Output: s.id, s.description, s.active_flag, s.shared_district_id, s.additional_text, s.assets
                                                  Buffers: shared hit=1 read=1
                                            ->  Materialize  (cost=0.00..1.38 rows=25 width=4) (actual time=0.031..0.034 rows=12 loops=11)
                                                  Output: ig.id
                                                  Buffers: shared hit=1
                                                  ->  Seq Scan on public."industry_districts" ig  (cost=0.00..1.25 rows=25 width=4) (actual time=0.043..0.048 rows=25 loops=1)
                                                        Output: ig.id
                                                        Buffers: shared hit=1
                                      ->  Materialize  (cost=0.00..2.05 rows=70 width=8) (actual time=0.003..0.013 rows=64 loops=11)
                                            Output: i.industry_districts_id, i.id
                                            Buffers: shared hit=1
                                            ->  Seq Scan on public."industries" i  (cost=0.00..1.70 rows=70 width=8) (actual time=0.021..0.033 rows=70 loops=1)
                                                  Output: i.industry_districts_id, i.id
                                                  Buffers: shared hit=1
                                ->  Materialize  (cost=0.00..4.38 rows=159 width=8) (actual time=0.001..0.017 rows=158 loops=34)
                                      Output: si.industries_id, si.id
                                      Buffers: shared hit=2
                                      ->  Seq Scan on public."subdistricts" si  (cost=0.00..3.59 rows=159 width=8) (actual time=0.018..0.054 rows=159 loops=1)
                                            Output: si.industries_id, si.id
                                            Buffers: shared hit=2
                          ->  Materialize  (cost=0.00..174.03 rows=2869 width=8) (actual time=0.007..0.315 rows=2841 loops=80)
                                Output: rs.subdistricts_id, rs.id
                                Buffers: shared read=131
                                ->  Seq Scan on public."supplys" rs  (cost=0.00..159.69 rows=2869 width=8) (actual time=0.505..3.618 rows=2869 loops=1)
                                      Output: rs.subdistricts_id, rs.id
                                      Buffers: shared read=131
                    ->  Index Scan using "IXFK_event_relations_supplys" on public."event_relations" ers  (cost=0.44..350.07 rows=8451 width=8) (actual time=0.112..67.905 rows=8819 loops=927)
                          Output: ers.id, ers.events_id, ers.supplys_id, ers.additional_information
                          Index Cond: (ers.supplys_id = rs.id)
                          Buffers: shared hit=2084410 read=6047706
              ->  Index Scan using "IXFK_stories_events" on public."stories"  (cost=0.42..1.40 rows=56 width=8) (actual time=0.003..0.006 rows=12 loops=8174966)
                    Output: "stories".id, "stories".events_id,"stories".published_at
                    Index Cond: ("stories".events_id = ers.events_id)
                    Buffers: shared hit=27191556 read=514782
Planning:
  Buffers: shared hit=57 read=23
Planning Time: 30.368 ms
Execution Time: 222346.279 ms

Indexes created

CREATE INDEX "IXFK_event_relations_events" ON public."event_relations" USING btree (events_id);

CREATE INDEX "IXFK_event_relations_supplys" ON public."event_relations" USING btree (supplys_id);

CREATE UNIQUE INDEX "PK_event_relations" ON public."event_relations" USING btree (id);

CREATE INDEX "IXFK_stories_events" ON public."stories" USING btree (events_id);

CREATE UNIQUE INDEX "PK_stories" ON public."stories" USING btree (id);

CREATE INDEX "IXFK_industries_industry_districts" ON public."industries" USING btree (industry_districts_id);

CREATE UNIQUE INDEX "PK_industries" ON public."industries" USING btree (id);

CREATE INDEX "IXFK_industry_districts_district s" ON public."industry_districts" USING btree (district s_id);

CREATE UNIQUE INDEX "PK_industry_districts" ON public."industry_districts" USING btree (id);

CREATE UNIQUE INDEX "PK_district s" ON public."district s" USING btree (id);

CREATE INDEX "IXFK_subsectors_industries" ON public."subsectors" USING btree (industries_id);

CREATE UNIQUE INDEX "PK_subsectors" ON public."subsectors" USING btree (id);


CREATE TABLE public."events" (
    id serial4 NOT NULL,
    updated_at timestamptz NULL,
    created_at timestamptz NOT NULL,
    title text NOT NULL,
    CONSTRAINT "PK_events" PRIMARY KEY (id)
);

CREATE TABLE public."districts" (
    id serial4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_districts" PRIMARY KEY (id)
);

CREATE TABLE public."stories" (
    id serial4 NOT NULL,
    events_id int4 NOT NULL,
    created_at timestamptz NULL,
    CONSTRAINT "PK_stories" PRIMARY KEY (id),
    CONSTRAINT "FK_stories_events" FOREIGN KEY (events_id) REFERENCES public."events"(id)
);
CREATE INDEX "IXFK_stories_events" ON public."stories" USING btree (events_id);


CREATE TABLE public."industry_districts" (
    id serial4 NOT NULL,
    districts_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_industry_districts" PRIMARY KEY (id),
    CONSTRAINT "FK_industry_districts_districts" FOREIGN KEY (districts_id) REFERENCES public."districts"(id)
);
CREATE INDEX "IXFK_industry_districts_districts" ON public."industry_districts" USING btree (districts_id);


CREATE TABLE public."industries" (
    id serial4 NOT NULL,
    industry_districts_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_industries" PRIMARY KEY (id),
    CONSTRAINT "FK_industries_industry_districts" FOREIGN KEY (industry_districts_id) REFERENCES public."industry_districts"(id)
);
CREATE INDEX "IXFK_industries_industry_districts" ON public."industries" USING btree (industry_districts_id);


CREATE TABLE public."subsector" (
    id serial4 NOT NULL,
    industries_id int4 NOT NULL,
    description varchar(50) NOT NULL,
    CONSTRAINT "PK_subsector" PRIMARY KEY (id),
    CONSTRAINT "FK_subsector_industries" FOREIGN KEY (industries_id) REFERENCES public."industries"(id)
);
CREATE INDEX "IXFK_subsector_industries" ON public."subsector" USING btree (industries_id);


CREATE TABLE public."supplys" (
    id serial4 NOT NULL,
    subsector_id int4 NOT NULL,
    description text NOT NULL,
    CONSTRAINT "PK_supplys" PRIMARY KEY (id),
    CONSTRAINT "FK_supplys_subsector" FOREIGN KEY (subsector_id) REFERENCES public."subsector"(id)
);
CREATE INDEX "IXFK_supplys_subsector" ON public."supplys" USING btree (subsector_id);


CREATE TABLE public."event_relations" (
    id serial4 NOT NULL,
    events_id int4 NOT NULL,
    supplys_id int4 NOT NULL,
    CONSTRAINT "PK_event_relations" PRIMARY KEY (id),
    CONSTRAINT "FK_event_relations_events" FOREIGN KEY (events_id) REFERENCES public."events"(id),
    CONSTRAINT "FK_event_relations_supplys" FOREIGN KEY (supplys_id) REFERENCES public."supplys"(id)
);
CREATE INDEX "IXFK_event_relations_events" ON public."event_relations" USING btree (events_id);
CREATE INDEX "IXFK_event_relations_supplys" ON public."event_relations" USING btree (supplys_id);

I am using postgres version 14.0

How do I improve the performance of this query?

Could you please share the results from explain(analyze, verbose, buffers) for this statement, and share all DDL for all tables, views and indexes involved? As an update to your question, in plain text. And what are your expectations for the performance? The database is counting 104217089 records, that will always take a wile. — Frank Heikens, Jul 09 '23 at 12:03
I have updated my question with the details @Frank. The data is quite large but I was expecting to reduce it to probably a few seconds. It currently takes around two minutes. — Code, Jul 09 '23 at 12:47
Needed details are still missing. It appears that some of the relations are views instead of tables: `"Event_Related_Stocks"`, `"stories"`, and `"subdistricts"` are in the explain plan but not the query. — JohnH, Jul 09 '23 at 22:36

Lajos Arpad · Answer 1 · 2023-07-09T12:27:21.200

0

You could get rid of the subselects and allow PostgreSQL to use its internal optimizations while joining:

SELECT s.id, s.description, count(DISTINCT et.id)
  FROM "district s" s
  INNER JOIN "industry_districts" id2
    ON id2.id = s.id
  INNER JOIN "industries" i2
    ON i2.industry_districts_id = id2.id
  INNER JOIN "subsectors"  si2
    ON si2.industries_id = i2.id
  INNER JOIN "supplys" rs2
    ON rs2.subsectors_id = si2.id
  INNER JOIN "event_relations" ers2
    ON ers2.supplys_id = rs2.id
  LEFT JOIN "event_stories" et
    ON et.events_id = ers2.events_id
  GROUP BY s.id
  LIMIT 10 OFFSET 0;

I have no contextual knowledge about your query, so excuse me if I have made a typo.

EDIT

You could run a cron job that runs (and waits for the result) periodically based on the query above and store the results somewhere (cache, server files, table or materialized view, it's your choice).

Then, whenever the users are trying to access these values, they can reach out to the temporarily stored results rather than waiting for the results each time they access them.

edited Jul 09 '23 at 12:27

answered Jul 09 '23 at 11:44

Lajos Arpad

64,414
37
100
175

I tried it. It still takes the same number of seconds to execute – Code Jul 09 '23 at 11:54
@Code you can try adding some indexes on the foreign keys there. How many records do you have in these tables? Do you need realtime data or is it okay if they are 5 minutes old? – Lajos Arpad Jul 09 '23 at 11:58
There are indexes added on the foreign keys for all these tables @Lajos. It's okay even if the data is 5 minutes old – Code Jul 09 '23 at 12:22
@Code I have edited my answer with a further idea. If you want to improve the actual query, then you might want to share more information about your tables, like their number of records as well as their structure along with indexes. JohnH suggested the creation of a subselect for "district s" in his answer and that may or may not be a good idea, depending on your current situation. Please try that one as well, see whether it improves performance and if not, either implement a cron job or provide more info about your tables. – Lajos Arpad Jul 09 '23 at 12:30
I have updated my question with the details asked. And a subselect for "district s" does not seem to help – Code Jul 09 '23 at 13:16
@Code "district s" no table definition. Is the id a primary key? Do you have an index on supplys.subsectors_id? – Lajos Arpad Jul 09 '23 at 14:57
@Code If I were you, I would start by selecting from a single table without joining the others. If it's slow, then I have a problem with that table already. If not, then I would add the next table to the query and measure that selection as well, until I find the exact dimension that is causing the slowness. That would allow you to locate the exact join that causes the problem. You can restrict your list of columns to a subset of the columns you are to select while you conduct this experiment if you count distinct values from the last table. – Lajos Arpad Jul 09 '23 at 15:00
@Code I would leave the group by clause inside the query from the very start, to see whether the aggregation is the actual cause of the problem. – Lajos Arpad Jul 09 '23 at 15:01
Yes. The aggregation seems to be the actual problem. It takes a few milliseconds without the group by and count. And it works faster until event_relations join is added. Once event_relations join is added with aggregation, it takes 2 minutes – Code Jul 10 '23 at 04:59
@Code if you run a query selecting from the main table only (without the joins) and aggregate it, is it slow? – Lajos Arpad Jul 10 '23 at 07:47
Yes. This query is fast - select s.id, s.description, count(distinct rs2.id) from "district s" s inner join ( select id.id from "industry_districts" id ) as id2 on id2.id = s.id inner join ( select id, industry_districts_id from "industries" i ) as i2 on i2.industry_districts_id = id2.id inner join ( select id, industries_id from "subsectors" si ) as si2 on si2.industries_id = i2.id inner join ( select id, subsectors_id from "supplys" rs ) as rs2 on rs2.subsectors_id = si2.id group by s.id limit 10 offset 0 – Code Jul 10 '23 at 08:10
As long as I do not use event_relations and stories, it's fast – Code Jul 10 '23 at 08:11
@Code if you use event_relations only, is it getting slow already (with aggregation)? – Lajos Arpad Jul 10 '23 at 08:27
Yes @Lajos. I have decided to go forward with cron. Seems like there are too many records to lessen the number of seconds – Code Jul 10 '23 at 09:42
@Code that's a good idea. Out of curiosity, can you look at the actual id values of event_relations in the result set and replace the table's selection with the dummy (select 321 as supply_id, 123 as event_id union select 654 as supply_id, 456 as event_id union select 987 as supply_id, 789 as event_id) as ers2 on ers2.supplys_id = rs2.id and then join the last table? Is it fast if we replace the penultimate table with a dummy representation of only the results for the penultimate table? Would adding the last table be fast if the penultimate table would be fast? – Lajos Arpad Jul 10 '23 at 09:53
I will check that and get back to you @Lajos – Code Jul 10 '23 at 12:56

score 0 · Answer 2 · answered Jul 09 '23 at 12:24

The following equivalent query should run much faster:

SELECT s.id, s.description, count(DISTINCT et.id)
  FROM (SELECT id, description
          FROM "district s"
          GROUP BY id
          LIMIT 10 OFFSET 0) s
  INNER JOIN "industry_districts" id2
    ON id2.id = s.id
  INNER JOIN "industries" i2
    ON i2.industry_districts_id = id2.id
  INNER JOIN "subsectors"  si2
    ON si2.industries_id = i2.id
  INNER JOIN "supplys" rs2
    ON rs2.subsectors_id = si2.id
  INNER JOIN "event_relations" ers2
    ON ers2.supplys_id = rs2.id
  LEFT JOIN "event_stories" et
    ON et.events_id = ers2.events_id;

The returned rows are indeterminate in this query, and the one from the original post, because they lack an ORDER BY clause before the LIMIT and OFFSET.

This seems to increase the time taken to 3 minutes @JohnH – Code Jul 09 '23 at 13:17 — Code, Jul 09 '23 at 13:17

Postgres query involving joins and count too slow

2 Answers2