Query Optimization for Large Dataset with Complex Join and Aggregation

Question

Table A: (columns: id (integer), name (varchar), ...)
Table B: (columns: id (integer), a_id (integer), value (numeric), ...)
Table C: (columns: id (integer), a_id (integer), b_id (integer), date (date), ...)

SELECT A.name, SUM(B.value) AS total_value,
    COUNT(DISTINCT C.date) AS distinct_dates
FROM A
JOIN B ON A.id = B.a_id
JOIN C ON B.id = C.b_id
WHERE C.date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY A.name
ORDER BY total_value DESC;

The query takes a considerable amount of time to execute, even when limited to a specific date range. Indexes are in place on the relevant columns.

Table A has approximately 10 million rows.
Table B has approximately 20 million rows.
Table C has approximately 50 million rows.

The query execution time increases exponentially with larger date ranges.

What are specific indexes, query rewrites, or configuration changes to improve the execution time?

What are alternative approaches for handling such a complex join and aggregation operation on a large dataset?

Could you please share the results from explain (analyze, verbose, buffers) for this statement (in plain text) and the ddl for the tables and indexes involved? — Frank Heikens, May 27 '23 at 17:17
Please ask 1 specific researched non-duplicate question. [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [mre] [ask] [Help] — philipxy, May 29 '23 at 11:44
[Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) — philipxy, May 29 '23 at 11:45

score 0 · Answer 1 · edited May 28 '23 at 15:55

It's difficult to say without knowing the actual dataset and database configuration. Some general tips would be:

Make sure you have the right indexes: indexes can improve query performance by reducing the number of data to be scanned. In your case, you should have indexes on all columns used in join conditions, filters, and order by clauses. (A.id, B.a_id, B.id, C.b_id, and C.date).
Use EXPLAIN to analyze your query: it can help you understand how the query will be executed so you can identify potential bottlenecks.
Try rewriting the query: sometimes it helps to improve performance. Tou can try subqueries or CTEs to break down the query into smaller parts.
Consider partitioning your data: it may also help reduce the number of data to be scanned. But it depends on whether your data can be split into more manageable chunks or not.

Thorsten Kettner · Answer 2 · 2023-05-29T12:12:06.130

Indexes are in place on the relevant columns

Are they? To have this query run fast, you should have an index on C's date column and its a_id, so to get quickly from the WHERE clause to the first join.

create index idx_c on c (date) include (b_id);

Do you have this index?

If this is still too slow, you may want to consider providing covering indexes on the other tables:

create unique index idx_b on b (id) include (a_id, value);

create unique index idx_a on a (id) include (name);

In the end, this all still depends on the date range you provide and the decision the DBMS makes based on this. If your date range covers 90% of the C rows, then full tabke scans may be a lot faster than using indexes. If your date range covers 1% or even less of the C rows, then running through the indexes will be faster. We don't know the threshold when indexes become preferable, but the DBMS will do its best to make the best decision.

As table A is not really part of the aggregation, it may help to make this plainly visible to the DBMS. Ideally, the optimizer should see this is the same query and come up with the same execution plan, but optimizers are not perfect, so a hint may help every now and then.

SELECT a.name, bc.total_value, bc.distinct_dates
FROM 
(
  SELECT
    a_id,
    SUM(b.value) AS total_value,
    COUNT(DISTINCT C.date) AS distinct_dates
  FROM c
  JOIN b ON b.id = c.b_id
  WHERE c.date BETWEEN DATE '2022-01-01' AND DATE '2022-12-31'
  GROUP BY a_id
) bc
JOIN a on ON a.id = bc.a_id
ORDER BY bc.total_value DESC;

Query Optimization for Large Dataset with Complex Join and Aggregation

2 Answers2