postgresql improve the query scan/filtering

Question

I have the following table for attributes of different objects

create table attributes(id serial primary key,
object_id int, 
attribute_id text,
text_data text,
int_data int,
timestamp_data timestamp,
state text default 'active');

an object will have different type of attributes and attribute value will be in one column among text_data or int_data or timestamp_data , depending on attribute data type.

sample data is here

I want to retrieve the records, my query is

select * from attributes
where attribute_id = 55 and state='active' 
order by text_data

which is very slow.

increased the work_mem to 1 GB for current session. using set command SET work_mem TO '1 GB'; to improve the sort method from external merge Disk to quicksort

But no improvement in query execution. Query executed plan is

Gather Merge  (cost=750930.58..1047136.19 rows=2538728 width=128) (actual time=18272.405..27347.556 rows=3462116 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=235635 read=201793
  ->  Sort  (cost=749930.56..753103.97 rows=1269364 width=128) (actual time=14299.222..15494.774 rows=1154039 loops=3)
        Sort Key: text_data
        Sort Method: quicksort  Memory: 184527kB
        Worker 0:  Sort Method: quicksort  Memory: 266849kB
        Worker 1:  Sort Method: quicksort  Memory: 217050kB
        Buffers: shared hit=235635 read=201793
        ->  Parallel Seq Scan on attributes   (cost=0.00..621244.50 rows=1269364 width=128) (actual time=0.083..3410.570 rows=1154039 loops=3)
              Filter: ((attribute_id = 185) AND (state = 'active'))
              Rows Removed by Filter: 8652494
              Buffers: shared hit=235579 read=201793
Planning Time: 0.453 ms
Execution Time: 29135.285 ms

the query total runtime in 45 sec

Successfully run. Total query runtime: 45 secs 237 msec.
3462116 rows affected.

To improve filtering and query execution time, created index on attribute_id and state create index attribute_id_state on attributes(attribute_id,state);

Sort  (cost=875797.49..883413.68 rows=3046474 width=128) (actual time=47189.534..49035.361 rows=3462116 loops=1)
  Sort Key: text_data
  Sort Method: quicksort  Memory: 643849kB
  Buffers: shared read=406048
  ->  Bitmap Heap Scan on attributes   (cost=64642.80..547711.91 rows=3046474 width=128) (actual time=981.857..10348.441 rows=3462116 loops=1)
        Recheck Cond: ((attribute_id = 185) AND (state = 'active'))
        Heap Blocks: exact=396586
        Buffers: shared read=406048
        ->  Bitmap Index Scan on attribute_id_state  (cost=0.00..63881.18 rows=3046474 width=0) (actual time=751.909..751.909 rows=3462116 loops=1)
              Index Cond: ((attribute_id = 185) AND (state = 'active'))
              Buffers: shared read=9462
Planning Time: 0.358 ms
Execution Time: 50388.619 ms

but query become very slow after creating index.

Table has 29.5 Million rows. text_data is null in 9 Million rows. Query is returning almost 3 million records, which is 10% of table.

Is there any other index or the other way like changing parameter etc to improve the query ?

Add text_data as last part of index. `create index attribute_id_state_text_data on attributes(attribute_id,state,text_data)` . Additionally select is incorect because attribute_id has text type and 55 is integer — Kadet, Jun 04 '22 at 13:22
Why are you doing this at all? What are you going to do with 10% of the table once you have it? — jjanes, Jun 04 '22 at 13:23
second option (if there is sometimes no conditions in where, only order) is to build index only on `text_data` field to avoid sorting during select — Kadet, Jun 04 '22 at 13:28

score 2 · Answer 1 · answered Jun 04 '22 at 13:58

Some suggestions:

ORDER BY clauses can be accelerated by indexes. So if you put your ordering column in your compound index you may get things to go much faster.
```
CREATE INDEX attribute_id_state_data
     ON attributes(attribute_id, state, text_data);
```
This index is redundant with the one in your question, so drop that one when you create this one.
You use SELECT *, a notorious performance and maintainability antipattern. You're much better off naming the columns you want. This is especially important when your result sets are large: why waste CPU and network resources on data you don't need in your application? So, let's assume you want to do this. If you don't need all those columns, remove some of them from this SELECT.
```
SELECT id, object_id, attribute_id, text_data, int_data,
       timestamp_data, state ...
```
You can use the INCLUDE clause on your index so it covers your query, that is so the query can be satisfied entirely from the index.
```
CREATE INDEX attribute_id_state_data
     ON attributes(attribute_id, state, text_data)
     INCLUDE (id, object_id, int_data, timestamp_data, state)
```
When you use this BTREE index, your query is satisfied by random-accessing the index to the first eligible row and then scanning the index sequentially. There's no need for PostgreSQL to bounce back to the table's data. It doesn't get much faster than that for a big result set.

If you remove some columns from your SELECT clause, you can also remove them from the index's INCLUDE clause.
You ORDER BY a large-object TEXT column. That's a lot of data to sort in each record, whether during index creation or a query. It's stored out-of-line, so it's not as fast. Can you rework your application to use a limited-length VARCHAR column for this instead? It will be more efficient.

Few articles said there is no difference in terms of performance between varchar and text. https://stackoverflow.com/questions/4848964/difference-between-text-and-varchar-character-varying https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-char-varchar-text/#:~:text=The%20performance%20of%20the%20VARCHAR,the%20VARCHAR(n)%20column. — Hari, Jun 05 '22 at 12:19
Created index `CREATE INDEX attribute_id_state_data ON attributes(attribute_id, state, text_data);` but planner didn't chose the index. To learn how index scan performs , I had `SET enable_seqscan = OFF;`. but no improvement in performance. I didn't try INCLUDE as all columns are required — Hari, Jun 06 '22 at 15:07

postgresql improve the query scan/filtering

1 Answers1