0

There are lots of questions on this topic, but all of them seem to be more complex cases than what I'm looking at at the moment and the answers don't seem applicable.

OHDSI=> \d record_counts
               Table "results2.record_counts"
         Column         |         Type          | Modifiers
------------------------+-----------------------+-----------
 concept_id             | integer               |
 schema                 | text                  |
 table_name             | text                  |
 column_name            | text                  |
 column_type            | text                  |
 descendant_concept_ids | bigint                |
 rc                     | numeric               |
 drc                    | numeric               |
 domain_id              | character varying(20) |
 vocabulary_id          | character varying(20) |
 concept_class_id       | character varying(20) |
 standard_concept       | character varying(1)  |
Indexes:
    "rc_dom" btree (domain_id, concept_id)
    "rcdom" btree (domain_id)
    "rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)

The table has 3,133,778 records, so Postgres shouldn't be ignoring the index because of small table size.

I filter on domain_id, which is indexed, and the index is ignored:

OHDSI=> explain select * from record_counts where domain_id = 'Drug';
                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on record_counts  (cost=0.00..76744.81 rows=2079187 width=87)
   Filter: ((domain_id)::text = 'Drug'::text)

I turn off seqscan and:

OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on record_counts  (cost=42042.13..105605.97 rows=2079187 width=87)
   Recheck Cond: ((domain_id)::text = 'Drug'::text)
   ->  Bitmap Index Scan on rcdom  (cost=0.00..41522.33 rows=2079187 width=0)
         Index Cond: ((domain_id)::text = 'Drug'::text)

Indeed, the plan says it's going to be more expensive to use the index than not, but why? If the index lets it handle many fewer records, shouldn't it be quicker to use it?

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
Sigfried
  • 2,943
  • 3
  • 31
  • 43
  • Are your statistics up to date? – Mike Sherrill 'Cat Recall' Jan 23 '17 at 12:58
  • BTW: Doesn't the table have a primary key? – joop Jan 23 '17 at 13:02
  • @MikeSherrill'CatRecall': I ran `analyze record_counts`. Is there anything else I need to do? – Sigfried Jan 23 '17 at 13:13
  • @joop: It could have one on (concept_id, schema, table_name, column_name). Would that make a difference? – Sigfried Jan 23 '17 at 13:14
  • 1
    Yes, it makes a lot of difference. Not for the performance per se, but for the value of your data model. (In short: a table without (at least) a unique constraint basically has no meaning) – joop Jan 23 '17 at 13:21
  • @joop: What I'm working on is all read-only tables calculated off already existing tables. I'm just preprocessing stuff to make API access faster. In that case do you still think a primary key is needed? ... Well, to answer my own question, I guess if there's a chance I could end up with duplicate rows, that would be a big problem. Ok, point taken. Thanks. – Sigfried Jan 23 '17 at 13:29
  • Remember: a database table is *not* a spreadsheet with some indexes glued on. – joop Jan 23 '17 at 13:44
  • No, `analyze record_counts` is all you need to do at the moment. Read the "info" link on the postgresql-performance tag I added. Good education for the future. – Mike Sherrill 'Cat Recall' Jan 23 '17 at 13:54

1 Answers1

4

Ok, it looks like Postgres knew what it was doing. The particular value of the indexed column I was using ('Drug') happened to account for 66% of the rows in the table. So, yes, the filter makes the row set significantly smaller, but since those rows would be scattered between pages, the index doesn't allow them to be retrieved faster.

OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
      domain_id      |  rows   |   pct
---------------------+---------+---------
 Drug                | 2074991 | 66.2137
 Condition           |  466882 | 14.8984
 Observation         |  217807 |  6.9503
 Procedure           |  165800 |  5.2907
 Measurement         |  127239 |  4.0602
 Device              |   29410 |  0.9385
 Spec Anatomic Site  |   28783 |  0.9185
 Meas Value          |   10415 |  0.3323
 Unit                |    2350 |  0.0750
 Type Concept        |    2170 |  0.0692
 Provider Specialty  |    1957 |  0.0624
 Specimen            |    1767 |  0.0564
 Metadata            |    1689 |  0.0539
 Revenue Code        |     538 |  0.0172
 Place of Service    |     480 |  0.0153
 Race                |     467 |  0.0149
 Relationship        |     242 |  0.0077
 Condition/Obs       |     182 |  0.0058
 Currency            |     180 |  0.0057
 Condition/Meas      |     115 |  0.0037
 Route               |      81 |  0.0026
 Obs/Procedure       |      78 |  0.0025
 Condition/Device    |      52 |  0.0017
 Condition/Procedure |      25 |  0.0008
 Meas/Procedure      |      25 |  0.0008
 Gender              |      19 |  0.0006
 Device/Procedure    |       9 |  0.0003
 Meas Value Operator |       9 |  0.0003
 Visit               |       8 |  0.0003
 Drug/Procedure      |       3 |  0.0001
 Spec Disease Status |       3 |  0.0001
 Ethnicity           |       2 |  0.0001

When I use any other value in the where clause (including 'Condition', with 15% of the rows), Postgres does use the index.

(Somewhat surprisingly, even after I cluster the table based on the domain_id index, it still doesn't use the index when I filter on 'Drug', but the performance improvement for filtering out 34% of the rows doesn't seem worth pursuing this further.)

Sigfried
  • 2,943
  • 3
  • 31
  • 43
  • 2
    For a condition that returns 66% of all rows an index lookup will never be faster then a seq scan. Because retrieval through an index requires 3-4 I/O operations **per row** (random access). See e.g. here: http://stackoverflow.com/a/5203827/330315 –  Jan 23 '17 at 12:58