3

If I execute a query like this:

SELECT eid
FROM   entidades e
WHERE  distrito IN ( SELECT id FROM distritos WHERE distrito_t LIKE '%lisboa%' )

or

SELECT eid
FROM   entidades e
WHERE  concelho IN ( SELECT id FROM concelho WHERE concelho_t LIKE '%lisboa%' )

my indexes on distrito or concelho are being used.

For any of the queries stated above, the output of explain analyze will be something like this:

----------------------------------------------------------------------
Nested Loop  (cost=239.36..23453.18 rows=12605 width=4) (actual time=29.995..790.191 rows=100602 loops=1)
  ->  HashAggregate  (cost=1.38..1.39 rows=1 width=12) (actual time=0.081..0.085 rows=1 loops=1)
        ->  Seq Scan on distritos  (cost=0.00..1.38 rows=1 width=12) (actual time=0.058..0.068 rows=1 loops=1)
              Filter: ((distrito_t)::text ~~ '%lisboa%'::text)
  ->  Bitmap Heap Scan on entidades e  (cost=237.98..23294.23 rows=12605 width=7) (actual time=29.892..389.767 rows=100602 loops=1)
        Recheck Cond: (e.distrito = distritos.id)
        ->  Bitmap Index Scan on idx_t_ent_dis  (cost=0.00..234.83 rows=12605 width=0) (actual time=26.787..26.787 rows=100602 loops=1)
              Index Cond: (e.distrito = distritos.id)

However, for the following query, indexes are not used at all...

SELECT eid
FROM   entidades e
WHERE  concelho IN ( SELECT id FROM concelho WHERE concelho_t LIKE '%lisboa%' )
OR     distrito IN ( SELECT id FROM distritos WHERE distrito_t LIKE '%lisboa%' )

----------------------------------------------------------------------
Seq Scan on entidades e  (cost=10.25..34862.71 rows=283623 width=4) (actual time=0.600..761.876 rows=100604 loops=1)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  SubPlan 1
    ->  Seq Scan on distritos  (cost=0.00..1.38 rows=1 width=12) (actual time=0.083..0.093 rows=1 loops=1)
          Filter: ((distrito_t)::text ~~ '%lisboa%'::text)
  SubPlan 2
    ->  Seq Scan on concelhos  (cost=0.00..8.86 rows=3 width=5) (actual time=0.173..0.258 rows=1 loops=1)
          Filter: ((concelho_t)::text ~~ '%lisboa%'::text)

How can I create an index that will be used by the previous query?
According to this documentation it is possible...
But I'm probably not searching for the right thing since I can't find any example at all...

     

update: added explain output for both query types...

acm
  • 6,541
  • 3
  • 39
  • 44
  • How many rows does the table entidades contain? –  Mar 04 '11 at 20:50
  • @a_horse_with_no_name, > 350k – acm Mar 07 '11 at 11:41
  • 2
    so you are selecting about a third of all rows. In that case a sequential scan is faster than an index scan. Check out this post: http://stackoverflow.com/questions/5203755/why-does-postgresql-perform-sequential-scan-on-indexed-column/5203827#5203827 –  Mar 07 '11 at 12:39
  • @a_horse_with_no_name, great info. How would you approach this problem having that in mind? I have reasons to believe that this kind of queries that select 1/3 of total rows ARE going to be used regularly... – acm Mar 07 '11 at 12:45
  • live with the sequential scan - no clever indexing is going to make things magically faster –  Mar 07 '11 at 14:48

3 Answers3

1

I don't have an answer for you, only a question, why use a subquery ?

SELECT eid
FROM   entidades e
LEFT JOIN concelho c ON e.concelho = c.id
LEFT JOIN distritos d ON e.distrito = d.id
WHERE  
    concelho_t LIKE '%lisboa%' OR 
    distrito_t LIKE '%lisboa%';
krtek
  • 26,334
  • 5
  • 56
  • 84
  • Because this is just a part of a much more complex problem and it makes sense in my context. (tested that way several times with many variants and performance was far worse) Thanks anyways. – acm Mar 04 '11 at 14:57
  • The documentation you posted is only information about postgres, there's nothing you can do yourself to improve this. Maybe if you post your entire query, someone will be able to help, but I don't think you can do anything about your indexes to improve something without rewriting the query. – krtek Mar 04 '11 at 15:00
  • I believe it's relevant because that piece of documentation states it's possible to have multiple indexes (also, I have used it for `WHERE ... AND ...` conditions ) and states it can also be used to `WHERE ... OR ...` conditions. However can't find any example for that situations and I'm not sure how to implement. Also, you're query will not use any index at all for the same reason mines don't. – acm Mar 04 '11 at 15:05
1

The documentation you link to states: "the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well".

I wasn't able to get postgres (8.4) to behave the way you want when the where condition contained subqueries - only with simple conditions, so it might just be a limitation of the feature.

But the point really is that although the optimizer will try to choose the fastest execution path, it might not succeed, and in some cases you might need to 'encourage' another path, as I've done in the 'modified' query below with a union:

create table distritos(id serial primary key, distrito_t text);
insert into distritos(distrito_t) select 'distrito'||generate_series(1, 10000);

create table concelho(id serial primary key, concelho_t text);
insert into concelho(concelho_t) select 'concelho'||generate_series(1, 10000);

create table entidades( eid serial primary key, 
                        distrito integer not null references distritos, 
                        concelho integer not null references concelho );
insert into entidades(distrito, concelho)
select generate_series(1, 10000), generate_series(1, 10000);

original:

explain analyze 
select eid from entidades
where concelho in (select id from concelho where concelho_t like '%lisboa%')
   or distrito in (select id from distritos where distrito_t like '%lisboa%');

                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on entidades  (cost=299.44..494.94 rows=7275 width=4) (actual time=8.978..8.978 rows=0 loops=1)
   Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
   SubPlan 1
     ->  Seq Scan on concelho  (cost=0.00..149.71 rows=2 width=4) (actual time=3.922..3.922 rows=0 loops=1)
           Filter: (concelho_t ~~ '%lisboa%'::text)
   SubPlan 2
     ->  Seq Scan on distritos  (cost=0.00..149.71 rows=2 width=4) (actual time=3.363..3.363 rows=0 loops=1)
           Filter: (distrito_t ~~ '%lisboa%'::text)

modified:

explain analyze
  select eid from entidades 
  where concelho in (select id from concelho where concelho_t like '%lisboa%')
  union
  select eid from entidades
  where distrito in (select id from distritos where distrito_t like '%lisboa%');

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=648.98..650.92 rows=194 width=4) (actual time=5.409..5.409 rows=0 loops=1)
   ->  Append  (cost=149.74..648.50 rows=194 width=4) (actual time=5.399..5.399 rows=0 loops=1)
         ->  Hash Semi Join  (cost=149.74..323.28 rows=97 width=4) (actual time=2.743..2.743 rows=0 loops=1)
               Hash Cond: (stack.entidades.concelho = concelho.id)
               ->  Seq Scan on entidades  (cost=0.00..147.00 rows=9700 width=8) (actual time=0.013..0.013 rows=1 loops=1)
               ->  Hash  (cost=149.71..149.71 rows=2 width=4) (actual time=2.723..2.723 rows=0 loops=1)
                     ->  Seq Scan on concelho  (cost=0.00..149.71 rows=2 width=4) (actual time=2.716..2.716 rows=0 loops=1)
                           Filter: (concelho_t ~~ '%lisboa%'::text)
         ->  Hash Semi Join  (cost=149.74..323.28 rows=97 width=4) (actual time=2.655..2.655 rows=0 loops=1)
               Hash Cond: (stack.entidades.distrito = distritos.id)
               ->  Seq Scan on entidades  (cost=0.00..147.00 rows=9700 width=8) (actual time=0.006..0.006 rows=1 loops=1)
               ->  Hash  (cost=149.71..149.71 rows=2 width=4) (actual time=2.642..2.642 rows=0 loops=1)
                     ->  Seq Scan on distritos  (cost=0.00..149.71 rows=2 width=4) (actual time=2.642..2.642 rows=0 loops=1)
                           Filter: (distrito_t ~~ '%lisboa%'::text)
  • Thanks Jack, I was aware of this solution. It uses the indexes as expected however, thanks to the `append` this query is actually slower when the result set is big (in my case around 100k rows), and a lot faster when it's small, that much is true!!! – acm Mar 07 '11 at 12:04
  • If that is so then it sounds like postgres is making the right choice after all :) - using indexes is not always faster than scanning entire tables. The deciding factor will be related to what percentage of rows match the query rather than how many. I don't think the `append` will be improved on - all it is doing is combining the two result sets (eliminating duplicates, so it requires a sort). If the set of matches is so big that those sorts are slowing you down then the full scan with filtering is going to be faster full stop. –  Mar 07 '11 at 14:27
  • depending what you are doing with the result set you may be able to get away with a `union all` rather than a `union` which will certainly be faster - but you have to be able to handle duplicates in the set. –  Mar 07 '11 at 14:29
1

Try:

SELECT eid
FROM   entidades e
WHERE  concelho IN ( SELECT id FROM concelho WHERE concelho_t LIKE '%lisboa%' )
UNION
SELECT eid
FROM   entidades e
OR     distrito IN ( SELECT id FROM distritos WHERE distrito_t LIKE '%lisboa%' )

But the real problem is lack of Normalisation in the database, the absence of a hierarchy for Country, Territory, county (condado, parroquia), town, suburb. If you had that, then organisation could belong to the structure in the right place, and you would not have "Lisboa" occurring at both council at district levels.

PerformanceDBA
  • 32,198
  • 10
  • 64
  • 90
  • That's a nice assumption PerformanceDBA, however `lisboa` is just an example. I could be searching anything else like `lis`. How can you be sure `lis` will only occur on council or district or something else? Do you still believe it's a normalization issue? – acm Mar 07 '11 at 11:45
  • Also, as a side note: in Portugal, Lisboa is both a District (distrito) and a Council (concelho). :-) – acm Mar 07 '11 at 11:59
  • @andre. 1) I can only go by the info provided, and I am happy to correct my post when the info changes. I was not suggesting certainty about anything. 2) Normalisation: yes, I see this (kind of code construct to work around normalisation errors) and correct it all the time, with massive improvement in performance. Post your DDL, and I can speak to that 3) Have you tried the code ? – PerformanceDBA Mar 07 '11 at 23:12