How to store and query version of same document in PostgreSQL?

Question

I am storing versions of a document in PostgreSQL 9.4. Every time the user creates a new version, it inserts a row so that I can track all changes over time. Each row shares a reference_id column with the previous rows. Some of the rows get approved, and some remain as drafts. Each row also has a viewable_at time.

id | reference_id | approved | viewable_at         | created_on | content
1  | 1            | true     | 2015-07-15 00:00:00 | 2015-07-13 | Hello
2  | 1            | true     | 2015-07-15 11:00:00 | 2015-07-14 | Guten Tag
3  | 1            | false    | 2015-07-15 17:00:00 | 2015-07-15 | Grüß Gott

The most frequent query is to get the rows grouped by the reference_id where approved is true and viewable_at is less than the current time. (In this case, row id 2 would be included in the results)

So far, this is the best query I've come up with that doesn't require me to add additional columns:

SELECT DISTINCT ON (reference_id) reference_id, id, approved, viewable_at, content 
FROM documents 
WHERE approved = true AND viewable_at <= '2015-07-15 13:00:00' 
ORDER BY reference_id, created_at DESC`

I have an index on reference_id and a multi-column index on approved and viewable_at.

At only 15,000 rows it's still averaging a few hundred milliseconds (140 - 200) on my local machine. I suspect that the distinct call or the ordering may be slowing it down.

What is the most efficient way to store this information so that SELECT queries are the most performant?

Result of EXPLAIN (BUFFERS, ANALYZE):

                                                              QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=6668.86..6730.36 rows=144 width=541) (actual time=89.862..99.613 rows=145 loops=1)
  Buffers: shared hit=2651, temp read=938 written=938
  ->  Sort  (cost=6668.86..6699.61 rows=12300 width=541) (actual time=89.861..97.796 rows=13184 loops=1)
        Sort Key: reference_id, created_at
        Sort Method: external merge  Disk: 7488kB
        Buffers: shared hit=2651, temp read=938 written=938
        ->  Seq Scan on documents  (cost=0.00..2847.80 rows=12300 width=541) (actual time=0.049..40.579 rows=13184 loops=1)
              Filter: (approved AND (viewable_at < '2015-07-20 06:46:55.222798'::timestamp without time zone))
              Rows Removed by Filter: 2560
              Buffers: shared hit=2651
Planning time: 0.218 ms
Execution time: 178.583 ms
(12 rows)

Document Usage Notes:

The documents are manually edited and we're not yet autosaving the documents every X seconds or anything, so the volume will be reasonably low. At this point, there is an average of 7 versions and an average of only 2 approved versions per reference_id. (~30%)

On the min and max side, the vast majority of documents will have 1 or 2 versions and it seems unlikely that any document would have more than 30 or 40. There is a garbage collection process to clean out unapproved versions older than a week, so the total number of versions should stay pretty low.

For retrieving and practical usage, I could use limit / offset on the queries but in my tests that doesn't make a huge difference. Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results, but I'm not entirely sure how that would affect the resulting performance and am open to suggestions. My impression is that if I can get this storage / query as simple / fast as possible then all other queries that start from this point could be improved, but it's likely that I'm wrong and that each query needs more independent thought.

Can you post the results of ANALYZE and also give an estimate as to the number of revisions of an individual document, and the percentage of approved revisions. — Gary - Stand with Ukraine, Jul 20 '15 at 07:33
Do you really need to retrieve **all** documents that are older than given time? If your SELECT is not very selective (no pun intended), no amount of indexing can help... — Branko Dimitrijevic, Jul 20 '15 at 07:45
@Gary Since it's performance related, `EXPLAIN (BUFFERS, ANALYZE)` would be preferable. Plain `ANALYZE` lacks lots of useful info. — Craig Ringer, Jul 20 '15 at 08:54
@BrankoDimitrijevic I've added a note about using limit / offset and the scope of my queries. Thanks for the good question. — Jeremy Baker, Jul 21 '15 at 05:06
@CraigRinger Thanks! I didn't know about BUFFERS. That's helpful already. I've posted the output above. — Jeremy Baker, Jul 21 '15 at 05:06
@CraigRinger Thank you. That helped me discover that an index on reference_id and created_at significantly improved my query time. I'm still curious if that distinct on query is the right way to go, but it's a huge improvement. — Jeremy Baker, Jul 21 '15 at 05:25
Please be aware that offset has its problems. For a better way, see [here](http://use-the-index-luke.com/no-offset). — Branko Dimitrijevic, Jul 21 '15 at 13:43

score 2 · Answer 1 · answered Jul 21 '15 at 05:25

Looking at your explain output, it looks like you're fetching most of the contents in the documents table so it's sensibly doing a sequential scan. Your rowcount estimates are reasonable, there doesn't seem to be any stats issue here.

It's doing an external merge sort on disk, so you might see a significant increase in performance by increasing work_mem in the session, e.g.

SET work_mem = '12MB'

It's possible that an index on (reference_id ASC, created_at DESC) WHERE (approved) might be useful, since it'll allow results to be fetched in the order required.

You could also experiment with adding viewable_at to the index. I think it might have to be the last column, but I'm not sure. Or even making it into a covering index by appending viewable_at, id, content and omitting the unnecessary approved column from the result set. This may permit an index-only scan, though with DISTINCT ON involved I'm not sure.

score 1 · Accepted Answer · edited May 23 '17 at 11:46

@Craig already covers most options to make this query faster. More work_mem for the session is probably the most effective item.

Since:

There is a garbage collection process to clean out unapproved versions older than a week

A partial index excluding unapproved versions won't amount to much. If you use an index, you would still exclude those irrelevant rows, though.
Since you seem to have very few versions per reference_id:

the vast majority of documents will have 1 or 2 versions

You already have the best query technique with DISTINCT ON:

Select first row in each GROUP BY group?

With a growing number of versions, other techniques would be increasingly superior:

Optimize GROUP BY query to retrieve latest record per user

The only slightly unconventional element in your query is that the predicate is on viewable_at, but you then take the row with the latest created_at, which is why your index would actually be:

(reference_id, viewable_at ASC, created_at DESC) WHERE (approved)

Assuming all columns to be defined NOT NULL. The alternating sort order between viewable_at and created_at is important. Then again, while you have so few rows per reference_id I don't expect any index to be of much use. The whole table has to be read anyway, a sequential scan will be about as fast. The added maintenance cost of the index may even outweigh its benefit.

However, since:

Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results

I have one more suggestion: Create a MATERIALIZED VIEW from your query, giving you a snapshot of your project for the given point in time. If disk space is not an issue and snapshot might be reused, you might even collect a couple of those to stick around:

CREATE MATERIALIZED VIEW doc_20150715_1300 AS
SELECT DISTINCT ON (reference_id)
       reference_id, id, approved, viewable_at, content 
FROM   documents 
WHERE  approved  -- simpler expression for boolean column
AND    viewable_at <= '2015-07-15 13:00:00' 
ORDER  BY reference_id, created_at DESC;

Or, if all additional queries happen in the same session, use a temp table instead (which dies at the end of the session automatically):

CREATE TEMP TABLE doc_20150715_1300 AS ...;

ANALYZE doc_20150715_1300;

Be sure to run ANALYZE on the temp table (and also on the MV if you run queries immediately after creating it):

Either way, it may pay to create one or more indexes on the snapshots supporting subsequent queries. Depends on data and queries.

_{Note, the current version 1.20.0 of pgAdmin does not display indexes for MVs. That's already been fixed and is waiting to be released with the next version.}

Thank you Erwin. I was experimenting with Materialized Views last night and really liked their performance. The tricky part (and why I don't think I can use a materialized view) is that the viewable_at time needs to be provided to the query. Perhaps a temp table would solve this? — Jeremy Baker, Jul 21 '15 at 18:09
@JeremyBaker: If `viewable_at` is different for every query, then neither MV nor temp table are good options. Those make only sense for multiple queries based on the same snapshot. — Erwin Brandstetter, Jul 23 '15 at 01:34

How to store and query version of same document in PostgreSQL?

2 Answers2