3

I have a document management system that records all historical events in a history table. I've been asked to be able to provide the oldest doc_id that has a status of 5 for a given client on a given date. The table looks something like this (truncated for simplicity):

doc_history:
    id integer
    doc_id integer
    event_date timestamp
    client_id integer
    status_id integer

The client_id and status_id columns are the value of the document after the event has occurred. This means that the maximum history event row for a document defined by doc_id will match the same columns in the document table. Limiting events by a specific event date, you can see what the values of the document were at that time. Because these values are not static, I can't just simply search for a specific client_id with a status_id of 5 because the found result might not match the max(id) of the document. Hopefully that makes some sense.

What I've found to work, but is slow, is the following:

select
    t.*
from
    (select
        distinct on (doc_id),
        *
    from
        doc_history
    where
        event_date <= '2013-02-17 23:59:59'
    order by
        doc_id, id desc) t
where
    t.client_id = 9999 and
    t.status_id = 5
limit 1;

Basically, I'm getting the maximum ID for a particular document ID before the given maximum event date, and then verifying that that maximum history item is assigned to the given client, with the status set to 5.

The drawback to doing it my way is that I'm scanning all history records for all clients to get their maximums, and then finding what I'm looking for for one client and status. As of right now, this scans roughly 15.06 million rows, and takes about 90 seconds on my dev server (which is not blazing fast).

To make matters more complicated I need to do this for each day of the previous week, or seven times total each run. Additionally, all documents in the system start with a status of 5, which represents new. This makes it so that this query would just return the first document entered for that client:

select * from doc_history where client_id = 9999 and
    status_id = 5 and
    event_date <= '2013-02-17 23:59:59'
    order by id limit 1;

What I'm hoping to do is scan until I find a maximum history record for a specific document that matches the specific client and status values without having to first find the maximum ids for all document ids for all clients first. I don't know if this can be done with a windowing function (partition by) or some other logic that I'm currently not seeing.

An example of one of the events in the doc_history table:

# select id, doc_id, event, old_value, new_value, event_date, client_id, status_id from doc_history where doc_id = 9999999 order by id;
    id    | doc_id  | event | old_value | new_value |         event_date         | client_id | status_id
----------+---------+-------+-----------+-----------+----------------------------+-----------+-----------
 25362415 | 9999999 |    13 |           |           | 2013-02-14 11:49:50.032824 |      9999 |         5
 25428192 | 9999999 |    15 |           |           | 2013-02-18 11:15:48.272542 |      9999 |         5
 25428193 | 9999999 |     7 | 5         | 1         | 2013-02-18 11:15:48.301377 |      9999 |         1

Event 7 is status changed, and the old and new values show that it was changed from 5 to 1, which is reflected in the status_id column. For an event_date of less than or equal to 2013-02-17 23:59:59, the above record would have been the oldest "NEW" document with status_id of 5, but after 2/17/2013 it would not have.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Adam
  • 31
  • 4

3 Answers3

3

This should be much faster:

SELECT *
FROM   doc_history h1
WHERE  event_date < '2013-02-18 0:0'::timestamp
AND    client_id = 9999
AND    status_id = 5
AND NOT EXISTS (
   SELECT 1
   FROM   doc_history h2
   WHERE  h2.doc_id = h1.doc_id
   AND    h2.event_date < '2013-02-18 0:0'::timestamp
   AND    h2.event_date > h1.event_date  -- use event_date instead of id!
   )
ORDER  BY doc_id
LIMIT  1;

I had a very hard time making sense of your description. Basically, as I understand it now, you want the row with the biggest doc_id for a given (client_id, status_id) with event_date prior to a given timestamp, where no other row with a higher id (equals later event_date) for the same doc_id exists.

Note how I replaced the condition in your example:

WHERE  event_date <= '2013-02-17 23:59:59'

with:

WHERE  event_date < '2013-02-18 0:0'

Since you have fractional seconds, your expression would fail for timestamps like:
'2013-02-17 23:59:59.123'

I use h2.event_date > h1.event_date instead of h2.id > h1.id in the NOT EXISTS semi-join because I consider it unwise to assume bigger id equals later event_date. You should probably rely on event_date alone.

To make this fast, you need an multicolumn index of the form (updated):

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id, status_id, doc_id, event_date DESC);

I switched positions of doc_id, event_date DESC after your feedback, this should better accommodate ORDER BY doc_id LIMIT 1.

If the condition status_id = 5 is constant (you always check for 5), a partial index instead should be faster, yet:

CREATE INDEX doc_history_multi_idx
ON doc_history (client_id, doc_id, event_date DESC)
WHERE status_id = 5;

And:

CREATE INDEX doc_history_id_idx ON doc_history (doc_id, event_date DESC);
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • My apologies if I didn't come across clearly. Maybe my current head-cold has something to do with it. I like your point about the fractional seconds, so I will work on incorporating that (though it is real edge case for this system). Anyway, I really like your solution here, and I will be testing it tomorrow. Because the table only sees inserts, and never updates/deletes, the ID column is safe and can be interchanged with the event_date, but your concern is justified. One question about your timestamps: is there a reason for the 0:0 versus leaving it off? – Adam Feb 21 '13 at 06:05
  • @Adam: **`date` vs. `timestamp`**: Well, `'2013-02-18'::timestamp` works, too, but it looks like a `date`, while `'2013-02-18 0:0'::timestamp` makes clear it's a timestamp, even if you don't add the cast explicitly. – Erwin Brandstetter Feb 21 '13 at 11:55
  • @Adam: **`id` vs. `event_date`**: *You* know they are in sync, but the system doesn't. It has bearing on indexes among other things. As you can see I used `event_date` for both indexes, so you need to operate with `event_date` in the queries to use them. Or switch everything to `id`. *Mixing both*, like you did, has no benefit, but a number of potential drawbacks. If `id` is guaranteed to deliver, it would be slightly faster. 4 bytes for the `integer` vs. 8 bytes for the `timestamp`, which happens to fall within [MAXALIGN](http://stackoverflow.com/a/7431468/939860) for both indexes. – Erwin Brandstetter Feb 21 '13 at 12:02
  • I've been busy with all this today, and I can say that your query here does work, and is faster. However, the multi-index that you suggested here is not used in my 9.2 server. It's instead opting to use my simple doc_id index. The doc_id, event_date desc index is being used, however. I tried removing all my indexes and using only yours for testing just to be sure. On dev, with a full installation of the table, it's taking about 60 seconds to run, opposed to 90+ seconds on mine. I also agree with you on the event_date vs id, and have shifted my code accordingly. – Adam Feb 21 '13 at 22:43
  • @Adam: I updated the part on indexes after your feedback and further consideration. – Erwin Brandstetter Feb 22 '13 at 07:21
1

provide the oldest doc_id that has a status of 5 for a given client on a given date

This will do it:

select
    min(doc_id) doc_id
from
    doc_history
where
    client_id = 9999
    and status_id = 5
    and date event_date = '2013-02-17'

I have read your question more than once and can't get what you are talking about.

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • What I'm going for is this: what is the oldest unprocessed (NEW) document for client X on 2/17/2013. If I get the oldest event from the history table, I'll get a new document, but I need the newest event before 2/17/2013 with a status of 5 (new) for client X. The status and client values are not static between events because they reflect the changes caused by the events (such as the status and client changing). – Adam Feb 20 '13 at 23:06
  • I've added an example of what the rows in the table actually look like. – Adam Feb 20 '13 at 23:19
0

If I got that right, a n equivalent, and probably fast, query of yours would be:

select t.*
from doc_history
where event_date <= '2013-02-17 23:59:59' and
    t.client_id = 9999 and
    t.status_id = 5
order by doc_id, id desc
limit 1;
MatheusOl
  • 10,870
  • 3
  • 30
  • 28
  • Unfortunately it doesn't. It's a good attempt, but there is no guarantee that the events found by this query would be the last events recorded by the event_date maximum. For example, there could be an event after one returned here where the status_id or the client_id are different. In the records of the example document above, it had a status_id of 5 on 2/14 until 2/18. If you did your query with event_date <= 2013-02-18 23:59:59 you would end up with row id 25428192, but that's not valid because as of 2/18 it had a status_id of 1, not 5 as represented by row 25428193. – Adam Feb 21 '13 at 00:41