I need advice about indexing on arrays. In more words, I have query that I present:
SELECT b.id,
d.operation_company_code,
d.create_datetime::date,
d.operation_category,
d.operation_type,
d.partner_name,
d.programme_name,
d.currency,
sum(d.value) as value,
sum(d.value_pln) as value_pln,
count(distinct d.external_id_transaction_id) as trans_count,
'DAILY_NEW'
FROM risk.transactions_operations d
LEFT JOIN transactions_aggregated_for_test b on array[d.transaction_id::integer] && b.agr_transactions and b.aggregation_window ='DAILY_NEW'
WHERE d.create_datetime >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4 AND
d.create_datetime < date '2020-06-30' AND
d.operation_company_code in ('dotpay') AND
d.operation_category IS NOT NULL
GROUP BY 1,2,3,4,5,6,7,8
I have grouped ids in array called agr_transactions which i want to joined with transaction_id. I created extension intarray and indexed agr_transactions by Gin index:
CREATE INDEX indx_gist_agr_transactions_test ON risk.transactions_aggregated_for_test USING gin (agr_transactions gin__intbig_ops);
Here is explain analyze:
"GroupAggregate (cost=15546386.98..17516756.04 rows=9817806 width=210) (actual time=163940.899..780747.071 rows=3179 loops=1)"
" Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" -> Sort (cost=15546386.98..15696266.77 rows=59951915 width=140) (actual time=162462.432..182248.966 rows=76527789 loops=1)"
" Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" Sort Method: external merge Disk: 7130224kB"
" -> Nested Loop Left Join (cost=1000.58..5280968.02 rows=59951915 width=140) (actual time=8488.946..58085.938 rows=76527789 loops=1)"
" Join Filter: (ARRAY[(d.transaction_id)::integer] && c.agr_transactions)"
" -> Index Scan using indx_transactions_operations_create_datetime on transactions_operations d (cost=0.57..3872640.28 rows=59951915 width=148) (actual time=1252.611..34759.716 rows=76527789 loops=1)"
" Index Cond: ((create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)) AND (create_datetime < '2020-06-30'::date))"
" Filter: ((operation_category IS NOT NULL) AND ((operation_company_code)::text = 'dotpay'::text))"
" -> Materialize (cost=1000.00..209289.44 rows=1 width=36) (actual time=0.000..0.000 rows=0 loops=76527789)"
" -> Gather (cost=1000.00..209289.43 rows=1 width=36) (actual time=7236.301..7236.422 rows=0 loops=1)"
" Workers Planned: 4"
" Workers Launched: 4"
" -> Parallel Seq Scan on transactions_aggregated_for_test c (cost=0.00..208289.33 rows=1 width=36) (actual time=7198.879..7198.880 rows=0 loops=5)"
" Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
"Planning Time: 15.375 ms"
"JIT:"
" Functions: 36"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 12.960 ms, Inlining 122.047 ms, Optimization 702.757 ms, Emission 422.523 ms, Total 1260.287 ms"
"Execution Time: 781417.543 ms"
As you can see, index indx_gist_agr_transactions_test wasn't be scanned. Instead of that, there were nested loop.
Previously, I used unnest(), but I even didn't know how to create index on unnested element and it seems that using typical array functions can be faster.
Previous query:
SELECT c.id,
d.operation_company_code,
d.create_datetime::date,
d.operation_category,
d.operation_type,
d.partner_name,
d.programme_name,
d.currency,
sum(d.value) as value,
sum(d.value_pln) as value_pln,
count(distinct d.external_id_transaction_id) as trans_count,
'DAILY_NEW'
FROM risk.transactions_operations d
LEFT JOIN (select b.id, b.aggregation_window, unnest(b.agr_transactions) as transaction_id from risk.transactions_aggregated b where b.aggregation_window ='DAILY_NEW') c on d.transaction_id=c.transaction_id
WHERE d.create_datetime >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4 AND
d.create_datetime < date '2020-06-30' AND
d.operation_company_code in ('dotpay') AND
d.operation_category IS NOT NULL AND
(c.aggregation_window ='DAILY_NEW' OR c.aggregation_window IS NULL)
GROUP BY 1,2,3,4,5,6,7,8
Explain Analyze of previous query:
"GroupAggregate (cost=4005251.52..4005251.57 rows=1 width=210) (actual time=50946.759..349958.518 rows=3179 loops=1)"
" Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" -> Sort (cost=4005251.52..4005251.53 rows=1 width=140) (actual time=50309.153..70143.192 rows=76527789 loops=1)"
" Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" Sort Method: external merge Disk: 7130224kB"
" -> Gather (cost=1299354.57..4005251.51 rows=1 width=140) (actual time=1488.253..11990.729 rows=76527789 loops=1)"
" Workers Planned: 4"
" Workers Launched: 4"
" -> Parallel Hash Left Join (cost=1298354.57..4004251.41 rows=1 width=140) (actual time=1469.309..10435.223 rows=15305558 loops=5)"
" Hash Cond: (d.transaction_id = c.transaction_id)"
" Filter: (((c.aggregation_window)::text = 'DAILY_NEW'::text) OR (c.aggregation_window IS NULL))"
" -> Parallel Seq Scan on transactions_operations d (cost=0.00..2649607.10 rows=15010588 width=148) (actual time=0.062..7383.722 rows=15305558 loops=5)"
" Filter: ((operation_category IS NOT NULL) AND (create_datetime < '2020-06-30'::date) AND ((operation_company_code)::text = 'dotpay'::text) AND (create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)))"
" Rows Removed by Filter: 4331630"
" -> Parallel Hash (cost=1298354.55..1298354.55 rows=1 width=18) (actual time=1469.047..1469.048 rows=0 loops=5)"
" Buckets: 1024 Batches: 1 Memory Usage: 8kB"
" -> Subquery Scan on c (cost=0.00..1298354.55 rows=1 width=18) (actual time=1469.016..1469.017 rows=0 loops=5)"
" -> ProjectSet (cost=0.00..1298354.54 rows=10 width=18) (actual time=1469.016..1469.016 rows=0 loops=5)"
" -> Parallel Seq Scan on transactions_aggregated b (cost=0.00..1298354.49 rows=1 width=81) (actual time=1469.015..1469.015 rows=0 loops=5)"
" Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
" Rows Removed by Filter: 2022815"
"Planning Time: 3.771 ms"
"JIT:"
" Functions: 115"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 14.328 ms, Inlining 182.781 ms, Optimization 1702.648 ms, Emission 1149.316 ms, Total 3049.073 ms"
"Execution Time: 350175.376 ms"
Does it make sense to put an index on arrays? What can I do to speed up this query?
UPDATE: I've added some more information.
pg version: PostgreSQL 12.8 on x86_64-pc-linux-gnu
another indexes:
CREATE INDEX indx_transactions_operations_batch_id ON risk.transactions_operations (batch_id);
CREATE UNIQUE INDEX indx_transactions_operations_external_id ON risk.transactions_operations (external_id);
CREATE INDEX indx_transactions_operations_transaction_id ON risk.transactions_operations (transaction_id);
CREATE INDEX indx_transactions_operations_create_datetime ON risk.transactions_operations (create_datetime);
CREATE INDEX indx_transactions_operations_currency ON risk.transactions_operations (currency);
In general, these indexes take part in other operations.
EXPLAIN(ANALYZE , BUFFERS):
"GroupAggregate (cost=15546386.98..17516756.04 rows=9817806 width=210) (actual time=162589.361..794709.012 rows=3179 loops=1)"
" Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" Buffers: shared hit=2663905, temp read=891278 written=891283"
" -> Sort (cost=15546386.98..15696266.77 rows=59951915 width=140) (actual time=161073.298..181563.554 rows=76527789 loops=1)"
" Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
" Sort Method: external merge Disk: 7130224kB"
" Buffers: shared hit=2663905, temp read=891278 written=891283"
" -> Nested Loop Left Join (cost=1000.58..5280968.02 rows=59951915 width=140) (actual time=1562.304..52895.826 rows=76527789 loops=1)"
" Join Filter: (ARRAY[(d.transaction_id)::integer] && c.agr_transactions)"
" Buffers: shared hit=2663896"
" -> Index Scan using indx_transactions_operations_create_datetime on transactions_operations d (cost=0.57..3872640.28 rows=59951915 width=148) (actual time=1258.294..33932.598 rows=76527789 loops=1)"
" Index Cond: ((create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)) AND (create_datetime < '2020-06-30'::date))"
" Filter: ((operation_category IS NOT NULL) AND ((operation_company_code)::text = 'dotpay'::text))"
" Buffers: shared hit=2466699"
" -> Materialize (cost=1000.00..209289.44 rows=1 width=36) (actual time=0.000..0.000 rows=0 loops=76527789)"
" Buffers: shared hit=197197"
" -> Gather (cost=1000.00..209289.43 rows=1 width=36) (actual time=303.983..304.089 rows=0 loops=1)"
" Workers Planned: 4"
" Workers Launched: 4"
" Buffers: shared hit=197197"
" -> Parallel Seq Scan on transactions_aggregated_for_test c (cost=0.00..208289.33 rows=1 width=36) (actual time=265.796..265.797 rows=0 loops=5)"
" Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
" Buffers: shared hit=197197"
"Planning Time: 2.362 ms"
"JIT:"
" Functions: 36"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 13.116 ms, Inlining 105.580 ms, Optimization 717.825 ms, Emission 433.348 ms, Total 1269.870 ms"
"Execution Time: 795342.487 ms"
Number of rows in transactions_operations: 98 185 940 rows
Number of rows in transactions_aggregated: 10 114 076 rows
I want to optimization, because data in tables will increase about 5x in the future.
Tell me if I miss something that I should also add.