(first of all - apologies for the title but I couldn't come up with a better one)
Here is my problem - I have a table with 4 columns - entity::INT
, entry::TEXT
, state::INT
and day::INT
.
There could be anywhere from 50 to 1,000 entities
. Each entity
can have over 100 million entries
. Each entry
can have one or more states
, which changes if the data stored in the entry has changed but only one state
can be written for any particular day
. The day
starts at one and is incremented each day.
Example:
entity | entry | state | day
-------------------------------------
1 | ABC123 | 1 | 1
1 | ABC124 | 2 | 1
1 | ABC125 | 3 | 1
...
1 | ABC999 | 999 | 1
2 | BCD123 | 1000 | 1
...
1 | ABC123 | 1001 | 2
2 | BCD123 | 1002 | 3
The index is set to (entity, day, state)
.
What I want to achieve is to efficiently select the most current state of each entry on day N.
Currently, every week I write all the entries with their latest state
to the table so to minimize the number of days that we need to scan, however, given the total number of entries
(worst case scenario - 1,000 entities times 100,000,000 entries is a lot of rows to write each week) the table is slowly but surely bloats and everything becomes really slow.
I need to be able to stop writing this "full" version weekly and instead have a setup that will still be fast enough to achieve that. I considered to use DISTINCT ON
with a different index set to (entity, entry, day DESC, state)
so that I could:
SELECT DISTINCT ON (entity, entry) entry, state
FROM table
WHERE entity = <entity> AND day <= <day>
ORDER BY entity, entry, day DESC, state;
Would that be the most efficient way to do it or there are better ways? Or entry
possibly having hundreds of millions of unique values makes it a poor choice for the second column in the index and the performance will eventually come to a halt?