I am storing versions of a document in PostgreSQL 9.4. Every time the user creates a new version, it inserts a row so that I can track all changes over time. Each row shares a reference_id
column with the previous rows. Some of the rows get approved, and some remain as drafts. Each row also has a viewable_at
time.
id | reference_id | approved | viewable_at | created_on | content
1 | 1 | true | 2015-07-15 00:00:00 | 2015-07-13 | Hello
2 | 1 | true | 2015-07-15 11:00:00 | 2015-07-14 | Guten Tag
3 | 1 | false | 2015-07-15 17:00:00 | 2015-07-15 | Grüß Gott
The most frequent query is to get the rows grouped by the reference_id where approved
is true
and viewable_at
is less than the current time. (In this case, row id 2 would be included in the results)
So far, this is the best query I've come up with that doesn't require me to add additional columns:
SELECT DISTINCT ON (reference_id) reference_id, id, approved, viewable_at, content
FROM documents
WHERE approved = true AND viewable_at <= '2015-07-15 13:00:00'
ORDER BY reference_id, created_at DESC`
I have an index on reference_id and a multi-column index on approved and viewable_at.
At only 15,000 rows it's still averaging a few hundred milliseconds (140 - 200) on my local machine. I suspect that the distinct call or the ordering may be slowing it down.
What is the most efficient way to store this information so that SELECT queries are the most performant?
Result of EXPLAIN (BUFFERS, ANALYZE):
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=6668.86..6730.36 rows=144 width=541) (actual time=89.862..99.613 rows=145 loops=1)
Buffers: shared hit=2651, temp read=938 written=938
-> Sort (cost=6668.86..6699.61 rows=12300 width=541) (actual time=89.861..97.796 rows=13184 loops=1)
Sort Key: reference_id, created_at
Sort Method: external merge Disk: 7488kB
Buffers: shared hit=2651, temp read=938 written=938
-> Seq Scan on documents (cost=0.00..2847.80 rows=12300 width=541) (actual time=0.049..40.579 rows=13184 loops=1)
Filter: (approved AND (viewable_at < '2015-07-20 06:46:55.222798'::timestamp without time zone))
Rows Removed by Filter: 2560
Buffers: shared hit=2651
Planning time: 0.218 ms
Execution time: 178.583 ms
(12 rows)
Document Usage Notes:
The documents are manually edited and we're not yet autosaving the documents every X seconds or anything, so the volume will be reasonably low. At this point, there is an average of 7 versions and an average of only 2 approved versions per reference_id. (~30%)
On the min and max side, the vast majority of documents will have 1 or 2 versions and it seems unlikely that any document would have more than 30 or 40. There is a garbage collection process to clean out unapproved versions older than a week, so the total number of versions should stay pretty low.
For retrieving and practical usage, I could use limit / offset on the queries but in my tests that doesn't make a huge difference. Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results, but I'm not entirely sure how that would affect the resulting performance and am open to suggestions. My impression is that if I can get this storage / query as simple / fast as possible then all other queries that start from this point could be improved, but it's likely that I'm wrong and that each query needs more independent thought.