I need to create a view with some calculated and aggregated values. So I need certain values multiple times, like total_dist_pts
in the example below.
There is a loc_a_run
table with about 350 rows so far (constantly growing) and a loc_a_pline
table with somewhat more than 4 million rows (also growing):
prod=# \d loc_a_run
Table "public.loc_a_run"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+-----------------------------
loc_a_run_id | integer | | not null | generated always as identity
execution_time | timestamp with time zone | | not null | CURRENT_TIMESTAMP(0)
run_type_id | smallint | | not null |
...
has_errors | boolean | | |
Indexes:
"loc_a_run_pkey" PRIMARY KEY, btree (loc_a_run_id)
prod=# \d loc_a_pline
Table "public.loc_a_pline"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+-----------------------------
loc_a_pline_id | bigint | | not null | generated always as identity
loc_a_run_id | integer | | not null |
is_top | boolean | | not null |
...
dist_left | numeric | | not null |
dist_right | numeric | | not null |
...
Indexes:
"loc_a_pline_pkey" PRIMARY KEY, btree (loc_a_pline_id)
Foreign-key constraints:
"loc_a_pline_loc_a_run_id_fkey" FOREIGN KEY (loc_a_run_id) REFERENCES loc_a_run(loc_a_run_id) ON UPDATE CASCADE ON DELETE CASCADE
The solution I use right now:
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
/ COUNT(pline.loc_a_pline_id)
AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=154201.17..154577.71 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=154201.17..154519.69 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=153201.15..153204.56 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=153113.01..153130.07 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Hash Join (cost=21.67..120633.75 rows=1623963 width=62)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112756.41 rows=1623963 width=32)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107766.55 rows=1867855 width=30)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes around 14.2s to execute. I lack the experience to assess how good or bad the performance is for this part, but I could live with it. Of course, faster would be an advantage.
Because this contains duplicated code I tried to get rid of it by using a CTE (in the final view I need this for a few more calculations, but the pattern is the same):
WITH dist_pts AS (
SELECT loc_a_run_id
, CASE
WHEN is_top IS true
THEN ROUND(dist_right - dist_left, 2)
ELSE ROUND(dist_left - dist_right, 2)
END AS pts
FROM loc_a_pline
)
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(dist_pts.pts) AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(dist_pts.pts) / COUNT(pline.loc_a_pline_id) AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN dist_pts USING (loc_a_run_id)
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=575677889.59..575678266.13 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=575677889.59..575678208.12 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=575676889.57..575676892.98 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=575676801.43..575676818.49 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Parallel Hash Join (cost=155366.81..111024852.15 rows=23232597464 width=62)"
" Hash Cond: (loc_a_pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline (cost=0.00..107877.85 rows=1869785 width=22)"
" -> Parallel Hash (cost=120758.30..120758.30 rows=1625641 width=48)"
" -> Hash Join (cost=21.67..120758.30 rows=1625641 width=48)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112872.83 rows=1625641 width=18)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107877.85 rows=1869785 width=12)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes forever and seems to be the wrong approach. I struggle to understand the query plan to find my mistake(s).
So my questions are:
- Why does the CTE approach take so much time?
- What would be the smartest solution to avoid duplicated code and eventually reduce execution time?
- Is there a way to
SUM(dist_pts.pts)
only once? - Is there a way to
COUNT(pline.loc_a_pline_id)
in the same go as the subtraction in the CTE instead of accessing the bigloc_a_pline
table again? (is it accessed again at all?)
Any help is highly appreciated