1

I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:

     Column      |            Type             |                     Modifiers                      
-----------------+-----------------------------+----------------------------------------------------
 id              | integer                     | not null default nextval('pings_id_seq'::regclass)
 url             | character varying(255)      | 
 monitor_id      | integer                     | 
 response_status | integer                     | 
 response_time   | integer                     | 
 created_at      | timestamp without time zone | 
 updated_at      | timestamp without time zone | 
 response_body   | text                        | 
Indexes:
    "pings_pkey" PRIMARY KEY, btree (id)
    "index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
    "index_pings_on_monitor_id" btree (monitor_id)

I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:

SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)

It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.

When I run EXPLAIN ANALYZE, this is what I get:

Bitmap Heap Scan on pings  (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
  Recheck Cond: (monitor_id = 3)
  Rows Removed by Index Recheck: 11643313
  Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
  Rows Removed by Filter: 324834
  ->  Bitmap Index Scan on index_pings_on_monitor_id  (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
        Index Cond: (monitor_id = 3)

So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.

Nothing I've tried makes the query any faster. How would you optimize and/or index it?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Brandon Weiss
  • 591
  • 1
  • 5
  • 14
  • Look into making use of a partial index using where response_time IS NOT NULL. If it's only 10% of the data matching that criteria the index will be much smaller and quicker to search. Partial indexs can be magical at times:).http://www.postgresql.org/docs/9.3/static/indexes-partial.html – Kuberchaun Apr 03 '14 at 20:44
  • Right, I did look into that, however I don't think a partial index will work (at least not solely), because it's the opposite percentage, although that might not have been clear the way I phrased it. 90% of response times won't be NULL, so a partial index would only exclude about 10% of the rows. – Brandon Weiss Apr 03 '14 at 20:57
  • You should add your version of Postgres and the (relevant part of the) table definition - what you get with `\d tbl` from `psql`, so we see exact data types, constraints, indexes, ... And specify more closely how the conditions on `created_at` and `monitor_id` can change, How is `monitor_id` distributed? A partial index is still an option, even if it won't buy you much in this case ... – Erwin Brandstetter Apr 03 '14 at 21:14
  • Thanks for editing it to be clearer! I just updated with the exact table structure and my postgres version (9.2.4). Ah, sorry the conditions in which `created_at` and `monitor_id` might not have been clear from my example query. I need to be able run this query against every monitor, and there are/will be a lot of monitors. The `created_at` will always be the same in that it's for a month in the past, but that's a relative time which means it's always changing. – Brandon Weiss Apr 03 '14 at 21:23
  • @BrandonWeiss: That means, most of the rows (older than a month) are irrelevant for your queries? Also, all you select from the table is `response_time`. Is that correct? If not, please update with actual requirements. – Erwin Brandstetter Apr 03 '14 at 21:35
  • @ErwinBrandstetter yep, that's correct. For each monitor I'm only selecting about 40k rows or so, and the only column I care about is `response_time`. – Brandon Weiss Apr 03 '14 at 21:52

1 Answers1

2

Sequence of columns

Create a partial multicolumn index with the right sequence of columns. You have one:

"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)

But the sequence of columns is not serving you well. Reverse it:

CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;

The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance

As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.

You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL

Covering index

If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):

CREATE INDEX idx_pings_monitor_created
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL;  -- maybe

Or, you try this even ..

More radical partial index

Create a tiny helper function. Effectively a "global constant" in your db:

CREATE OR REPLACE FUNCTION f_ping_event_horizon()
  RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$;  -- One month in the past

Use it as condition in your index:

CREATE INDEX idx_pings_monitor_created_response_time
ON     pings (monitor_id, created_at DESC, response_time)
WHERE  response_time IS NOT NULL  -- maybe
AND   created_at > f_ping_event_horizon();

And your query looks like this now:

SELECT response_time
FROM   pings
WHERE  monitor_id = 3
AND    response_time IS NOT NULL
AND    created_at > '2014-03-03 20:23:07.254281'
AND    created_at > f_ping_event_horizon();

Aside: I trimmed some noise.

The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.

This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.

When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?

Why the function at all? This way, all the queries using the index can remain unchanged.

With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Dammit. I could have sworn that was one of the permutations I tried, but I guess I must have missed one. Regarding the existing index in the opposite order, that's actually there to make a different query performant. – Brandon Weiss Apr 03 '14 at 21:57
  • OK, so I tried that and it's _way_ better. Now it's about 500ms. You mention that it can be improved even further? How so? – Brandon Weiss Apr 03 '14 at 21:58
  • @BrandonWeiss: Quite a bit further, yes. – Erwin Brandstetter Apr 03 '14 at 22:03
  • Interesting. Regarding write operations, we do a few hundred to the `pings` table per minute. That's probably not a lot by SQL standards. I'm a little confused by the event horizon function. I had considered trying something like this, like using a relative timestamp (`DATE(NOW()) - 30`) or something like that in the partial index, but I wasn't sure if it would understand what I was trying to do and update properly. In this instance, the timestamp is hardcoded in the function, right? Will that work? The query always has to be a month ago relative to the current time. – Brandon Weiss Apr 03 '14 at 22:16
  • Btw, I added `response_time` in the last position (without the event horizon function) and the speed improved. It doesn't look like it's doing an index-only scan, but it dropped from 500ms to 300ms, which is awesome. – Brandon Weiss Apr 03 '14 at 22:25
  • @BrandonWeiss: I added some more explanation. Should clarify the handling of the even horizon. Conderning index-only scan: this feature can also be used for *parts* of the query. Read the Wiki page I linked to. Full explanation there. – Erwin Brandstetter Apr 03 '14 at 22:31
  • If you can afford it, run `VACUUM FULL ANALYZE tbl` (takes an exclusive lock and will take some time for 15M rows + indexes). Or better yet [`CLUSTER pings USING idx_pings_monitor_created_response_time`](http://stackoverflow.com/questions/13998139/optimize-postgres-timestamp-query-range/14007963#14007963). Then try your query again. The effect will deteriorate over time. – Erwin Brandstetter Apr 03 '14 at 22:36
  • 1
    Ah! The index just needs to be periodically updated! That's brilliant. You have been so awesomely helpful—I can't thank you enough :) – Brandon Weiss Apr 03 '14 at 22:38
  • @BrandonWeiss: Yeah, I think we achieved quite a bit here. ;) – Erwin Brandstetter Apr 03 '14 at 22:38
  • Just to follow up, it looks like postgres is in fact smart enough to know that it can use that index without including the function call as a redundant where clause. It uses the correct index regardless of if it's in the query or not, which is nice. However, I couldn't get postgres to do an index-only scan. It still does a Bitmap Heap Scan and then a Bitmap Index Scan, regardless of if the index contains the event horizon function. Any idea why that might be? – Brandon Weiss Apr 04 '14 at 02:26
  • @BrandonWeiss: Yeah. "Visibility map". It's all in [Postgres Wiki page](https://wiki.postgresql.org/wiki/Index-only_scans) I linked to. – Erwin Brandstetter Apr 04 '14 at 02:42