Indexing on arrays

Question

I need advice about indexing on arrays. In more words, I have query that I present:

SELECT  b.id,
            d.operation_company_code,
            d.create_datetime::date,
            d.operation_category,
            d.operation_type,
            d.partner_name,
            d.programme_name,
            d.currency,
            sum(d.value) as value,
            sum(d.value_pln) as value_pln,
            count(distinct d.external_id_transaction_id) as trans_count,
            'DAILY_NEW'
    FROM    risk.transactions_operations d
    LEFT JOIN    transactions_aggregated_for_test b on  array[d.transaction_id::integer] && b.agr_transactions and b.aggregation_window ='DAILY_NEW'
    WHERE   d.create_datetime           >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4 AND
            d.create_datetime           < date '2020-06-30'     AND
            d.operation_company_code    in ('dotpay')   AND
            d.operation_category        IS NOT NULL             
    GROUP BY 1,2,3,4,5,6,7,8

I have grouped ids in array called agr_transactions which i want to joined with transaction_id. I created extension intarray and indexed agr_transactions by Gin index:

CREATE INDEX indx_gist_agr_transactions_test ON risk.transactions_aggregated_for_test USING gin (agr_transactions gin__intbig_ops);

Here is explain analyze:

 "GroupAggregate  (cost=15546386.98..17516756.04 rows=9817806 width=210) (actual time=163940.899..780747.071 rows=3179 loops=1)"
"  Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"  ->  Sort  (cost=15546386.98..15696266.77 rows=59951915 width=140) (actual time=162462.432..182248.966 rows=76527789 loops=1)"
"        Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"        Sort Method: external merge  Disk: 7130224kB"
"        ->  Nested Loop Left Join  (cost=1000.58..5280968.02 rows=59951915 width=140) (actual time=8488.946..58085.938 rows=76527789 loops=1)"
"              Join Filter: (ARRAY[(d.transaction_id)::integer] && c.agr_transactions)"
"              ->  Index Scan using indx_transactions_operations_create_datetime on transactions_operations d  (cost=0.57..3872640.28 rows=59951915 width=148) (actual time=1252.611..34759.716 rows=76527789 loops=1)"
"                    Index Cond: ((create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)) AND (create_datetime < '2020-06-30'::date))"
"                    Filter: ((operation_category IS NOT NULL) AND ((operation_company_code)::text = 'dotpay'::text))"
"              ->  Materialize  (cost=1000.00..209289.44 rows=1 width=36) (actual time=0.000..0.000 rows=0 loops=76527789)"
"                    ->  Gather  (cost=1000.00..209289.43 rows=1 width=36) (actual time=7236.301..7236.422 rows=0 loops=1)"
"                          Workers Planned: 4"
"                          Workers Launched: 4"
"                          ->  Parallel Seq Scan on transactions_aggregated_for_test c  (cost=0.00..208289.33 rows=1 width=36) (actual time=7198.879..7198.880 rows=0 loops=5)"
"                                Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
"Planning Time: 15.375 ms"
"JIT:"
"  Functions: 36"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 12.960 ms, Inlining 122.047 ms, Optimization 702.757 ms, Emission 422.523 ms, Total 1260.287 ms"
"Execution Time: 781417.543 ms"

As you can see, index indx_gist_agr_transactions_test wasn't be scanned. Instead of that, there were nested loop.

Previously, I used unnest(), but I even didn't know how to create index on unnested element and it seems that using typical array functions can be faster.

Previous query:

SELECT  c.id,
            d.operation_company_code,
            d.create_datetime::date,
            d.operation_category,
            d.operation_type,
            d.partner_name,
            d.programme_name,
            d.currency,
            sum(d.value) as value,
            sum(d.value_pln) as value_pln,
            count(distinct d.external_id_transaction_id) as trans_count,
            'DAILY_NEW'
    FROM    risk.transactions_operations d
    LEFT JOIN (select b.id, b.aggregation_window, unnest(b.agr_transactions) as transaction_id from risk.transactions_aggregated b where b.aggregation_window   ='DAILY_NEW') c on d.transaction_id=c.transaction_id
    WHERE   d.create_datetime           >= date_trunc('month', date '2020-06-30') - interval '1 month' * 4 AND
            d.create_datetime           < date '2020-06-30'     AND
            d.operation_company_code    in ('dotpay')   AND
            d.operation_category        IS NOT NULL AND
            (c.aggregation_window   ='DAILY_NEW' OR c.aggregation_window IS NULL)           
    GROUP BY 1,2,3,4,5,6,7,8

Explain Analyze of previous query:

"GroupAggregate  (cost=4005251.52..4005251.57 rows=1 width=210) (actual time=50946.759..349958.518 rows=3179 loops=1)"
"  Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"  ->  Sort  (cost=4005251.52..4005251.53 rows=1 width=140) (actual time=50309.153..70143.192 rows=76527789 loops=1)"
"        Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"        Sort Method: external merge  Disk: 7130224kB"
"        ->  Gather  (cost=1299354.57..4005251.51 rows=1 width=140) (actual time=1488.253..11990.729 rows=76527789 loops=1)"
"              Workers Planned: 4"
"              Workers Launched: 4"
"              ->  Parallel Hash Left Join  (cost=1298354.57..4004251.41 rows=1 width=140) (actual time=1469.309..10435.223 rows=15305558 loops=5)"
"                    Hash Cond: (d.transaction_id = c.transaction_id)"
"                    Filter: (((c.aggregation_window)::text = 'DAILY_NEW'::text) OR (c.aggregation_window IS NULL))"
"                    ->  Parallel Seq Scan on transactions_operations d  (cost=0.00..2649607.10 rows=15010588 width=148) (actual time=0.062..7383.722 rows=15305558 loops=5)"
"                          Filter: ((operation_category IS NOT NULL) AND (create_datetime < '2020-06-30'::date) AND ((operation_company_code)::text = 'dotpay'::text) AND (create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)))"
"                          Rows Removed by Filter: 4331630"
"                    ->  Parallel Hash  (cost=1298354.55..1298354.55 rows=1 width=18) (actual time=1469.047..1469.048 rows=0 loops=5)"
"                          Buckets: 1024  Batches: 1  Memory Usage: 8kB"
"                          ->  Subquery Scan on c  (cost=0.00..1298354.55 rows=1 width=18) (actual time=1469.016..1469.017 rows=0 loops=5)"
"                                ->  ProjectSet  (cost=0.00..1298354.54 rows=10 width=18) (actual time=1469.016..1469.016 rows=0 loops=5)"
"                                      ->  Parallel Seq Scan on transactions_aggregated b  (cost=0.00..1298354.49 rows=1 width=81) (actual time=1469.015..1469.015 rows=0 loops=5)"
"                                            Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
"                                            Rows Removed by Filter: 2022815"
"Planning Time: 3.771 ms"
"JIT:"
"  Functions: 115"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 14.328 ms, Inlining 182.781 ms, Optimization 1702.648 ms, Emission 1149.316 ms, Total 3049.073 ms"
"Execution Time: 350175.376 ms"

Does it make sense to put an index on arrays? What can I do to speed up this query?

UPDATE: I've added some more information.

pg version: PostgreSQL 12.8 on x86_64-pc-linux-gnu

another indexes:

CREATE INDEX indx_transactions_operations_batch_id              ON risk.transactions_operations (batch_id);
CREATE UNIQUE INDEX indx_transactions_operations_external_id    ON risk.transactions_operations (external_id);
CREATE INDEX indx_transactions_operations_transaction_id        ON risk.transactions_operations (transaction_id);
CREATE INDEX indx_transactions_operations_create_datetime       ON risk.transactions_operations (create_datetime);
CREATE INDEX indx_transactions_operations_currency              ON risk.transactions_operations (currency);

In general, these indexes take part in other operations.

EXPLAIN(ANALYZE , BUFFERS):

"GroupAggregate  (cost=15546386.98..17516756.04 rows=9817806 width=210) (actual time=162589.361..794709.012 rows=3179 loops=1)"
"  Group Key: c.id, d.operation_company_code, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"  Buffers: shared hit=2663905, temp read=891278 written=891283"
"  ->  Sort  (cost=15546386.98..15696266.77 rows=59951915 width=140) (actual time=161073.298..181563.554 rows=76527789 loops=1)"
"        Sort Key: c.id, ((d.create_datetime)::date), d.operation_category, d.operation_type, d.partner_name, d.programme_name, d.currency"
"        Sort Method: external merge  Disk: 7130224kB"
"        Buffers: shared hit=2663905, temp read=891278 written=891283"
"        ->  Nested Loop Left Join  (cost=1000.58..5280968.02 rows=59951915 width=140) (actual time=1562.304..52895.826 rows=76527789 loops=1)"
"              Join Filter: (ARRAY[(d.transaction_id)::integer] && c.agr_transactions)"
"              Buffers: shared hit=2663896"
"              ->  Index Scan using indx_transactions_operations_create_datetime on transactions_operations d  (cost=0.57..3872640.28 rows=59951915 width=148) (actual time=1258.294..33932.598 rows=76527789 loops=1)"
"                    Index Cond: ((create_datetime >= (date_trunc('month'::text, ('2020-06-30'::date)::timestamp with time zone) - '4 mons'::interval)) AND (create_datetime < '2020-06-30'::date))"
"                    Filter: ((operation_category IS NOT NULL) AND ((operation_company_code)::text = 'dotpay'::text))"
"                    Buffers: shared hit=2466699"
"              ->  Materialize  (cost=1000.00..209289.44 rows=1 width=36) (actual time=0.000..0.000 rows=0 loops=76527789)"
"                    Buffers: shared hit=197197"
"                    ->  Gather  (cost=1000.00..209289.43 rows=1 width=36) (actual time=303.983..304.089 rows=0 loops=1)"
"                          Workers Planned: 4"
"                          Workers Launched: 4"
"                          Buffers: shared hit=197197"
"                          ->  Parallel Seq Scan on transactions_aggregated_for_test c  (cost=0.00..208289.33 rows=1 width=36) (actual time=265.796..265.797 rows=0 loops=5)"
"                                Filter: ((aggregation_window)::text = 'DAILY_NEW'::text)"
"                                Buffers: shared hit=197197"
"Planning Time: 2.362 ms"
"JIT:"
"  Functions: 36"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 13.116 ms, Inlining 105.580 ms, Optimization 717.825 ms, Emission 433.348 ms, Total 1269.870 ms"
"Execution Time: 795342.487 ms"

Number of rows in transactions_operations: 98 185 940 rows

Number of rows in transactions_aggregated: 10 114 076 rows

I want to optimization, because data in tables will increase about 5x in the future.

Tell me if I miss something that I should also add.

You provided some good information. But we need more for performance questions. Please consider instructions here: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Oct 11 '21 at 22:20
Also, the first query doesn't match the explain output. And `c.id` is invalid, probably c/p artifact. Please make it consistent. — Erwin Brandstetter, Oct 11 '21 at 23:48
Very little of the time was spent on that part of the query anyway. I see no reason to think using the index would make this faster. — jjanes, Oct 11 '21 at 23:52
Some more reference similar issue: [https://stackoverflow.com/a/29245753/11094058](https://stackoverflow.com/a/29245753/11094058) — saurabh bhadane, Oct 12 '21 at 01:06
The intarray extension does not offer something called gin__intbig_ops. — jjanes, Oct 12 '21 at 01:54
@ErwinBrandstetter Yep... It was my mistake. I've already changed alias. — Sebastian, Oct 12 '21 at 05:55
@jjanes This is often the case, but I expect a larger volume of data in the future and want to create an optimal query — Sebastian, Oct 12 '21 at 05:58

Indexing on arrays

0 Answers0