15

I am little bit confused and need some advice. I use PostgreSQL 11 database. I have such pretty simple sql statement:

SELECT DISTINCT "CITY", "AREA", "REGION"
    FROM youtube
WHERE
    "CITY" IS NOT NULL
AND
    "AREA" IS NOT NULL
AND
    "REGION" IS NOT NULL

youtube table which I use in sql statement has 25 million records. I think for thats why query takes 15-17 seconds to complete. For web project where I use that query it's too long. I'm trying to speed up the request.

I create such index for youtube table:

CREATE INDEX youtube_location_idx ON public.youtube USING btree ("CITY", "AREA", "REGION");

After this step I run query again but it takes the same time to complete. It seems like query don't use index. How do I know if any index is used in a query?

EXPLAIN ANALYZE return: enter image description here

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Nurzhan Nogerbek
  • 4,806
  • 16
  • 87
  • 193
  • 2
    It's not using the index. It's doing Full Table Scan (Seq Scan) on the table. – The Impaler Dec 07 '18 at 03:49
  • 1
    It seems like that. So how I can say to PostgreSQL 11 database to use specific index which I created? – Nurzhan Nogerbek Dec 07 '18 at 03:54
  • It might never use that index with the optimizer choosing instead to do a full table scan. NULLs do not get indexed, so NOT NULL isn't necessarily assisted by the index you just built. Read [this prior question/answer](https://stackoverflow.com/questions/3467982/indexing-null-values-in-postgresql) – Paul Maxwell Dec 07 '18 at 04:01
  • What records are you storing in the table, what's it's primary key? If the Area, region etc are independent of primary key, you should consider having separate tables for AREA, REGION and CODES and keep only their ids stored in your main table with a Foreign key. – Kaushik Nayak Dec 07 '18 at 04:30

3 Answers3

21

You answered the question in the title yourself by running EXPLAIN. The query plan shows which indexes are used and how. For details see the chapter "Using EXPLAIN" in the manual.

As for why the query uses a sequential scan and no indexes: 25 million row, 992781 rows removed. You are fetching 24709900 rows, that's almost all rows.

This is never going to be fast.
This only going to use an index in special situations.

Using an index typically only makes sense for a small fraction of all rows. Else it would just add additional cost. Depending on a number of co-factors, the Postgres query planner starts to consider a btree index for around 5% of all rows or less. Related:

Well, if your table rows are substantially wider than the three columns in your SELECT list, a (partial) covering index might help somewhat if you get index-only scans out of it. Again, needs to meet some preconditions. And every index also adds storage and maintenance costs.

Aside: A comment claimed NULL values couldn't be indexed. This is incorrect, NULL values can be indexed. Not as efficient as other values, but doesn't make much of a difference. Also irrelevant to the case at hand.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you for this information. What do you think about `partitioning` in my case? – Nurzhan Nogerbek Dec 07 '18 at 04:44
  • @NurzhanNogerbek: Partitioning won't help you. Fast storage, plenty of RAM (for repeated calls) and a fast connection to the server would help. (Transferring data over the wire might be the bottleneck here.) Do you *need* to fetch almost all rows? That is the core question here. – Erwin Brandstetter Dec 07 '18 at 04:46
  • Well, if I set specific values to columns `CITY`, `AREA`, `REGION`, partitioning anyway could help? – Nurzhan Nogerbek Dec 07 '18 at 04:55
  • Partitioning won't help this query. Period. Not sure what *"set specific values to columns "* is supposed to mean. Questions are: do you *need* to retrieve 24M rows and why the `DISTINCT`? Makes it more expensive, yet. – Erwin Brandstetter Dec 07 '18 at 04:59
  • By `set specific values to columns` I mean lets say we use such query: `SELECT DISTINCT "CITY", "AREA", "REGION" FROM youtube WHERE "CITY"="Alperton" AND "AREA"="Brent" AND "REGION"="WEMBLEY"`. I thought in that case partitioning could be useful. After your words I am confused. Also I notice that query without `DISTINCT` is faster. So I decided to abandon the `DISTINCT` command. – Nurzhan Nogerbek Dec 07 '18 at 07:46
  • Thank you for your detailed answer! Can I ask you one more question? If query use index what `EXPLAIN` command need to say? – Nurzhan Nogerbek Dec 07 '18 at 07:52
  • The query in your comment is *completely* different from the one in the question and probably returns only a few rows (if any). A perfect use case for indexes. Partitioning might also make sense there. `EXPLAIN` output will have the word "index" in one of various phrases. I added a link to details above. – Erwin Brandstetter Dec 07 '18 at 14:49
20

There is four types of scan that I know in PostgreSQL.

Sequential scan: Not uses index.

Index scan: Searches on index and then table.

Index only scan: Searches only on index, doesn't scan on actual table.

Bitmap heap scan: Somewhere between index scan and sequential scan.

Third row of your result (seq scan) shows that it scans whole table sequentially. So you are not using index.

mcan
  • 1,914
  • 3
  • 32
  • 53
0

I think you can use an index on this. Something like:

SELECT "CITY", "AREA", "REGION"
FROM (SELECT DISTINCT ON ("CITY", "AREA", "REGION") "CITY", "AREA", "REGION"
      FROM youtube
      ORDER BY "CITY", "AREA", "REGION"
     ) car
WHERE "CITY" IS NOT NULL AND
      "AREA" IS NOT NULL AND
      "REGION" IS NOT NULL;

This should use an index on ("CITY", "AREA", "REGION") for the SELECT DISTINCT -- which is presumably an expensive operation for this query.

That said, the query is going to return a lot of data. So even using the index may not be a noticeable improvement in overall performance.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786