Is there a better way to reconstruct a data from hundreds of millions of entries spread over a long period of time?

Question

(first of all - apologies for the title but I couldn't come up with a better one)

Here is my problem - I have a table with 4 columns - entity::INT, entry::TEXT, state::INT and day::INT.

There could be anywhere from 50 to 1,000 entities. Each entity can have over 100 million entries. Each entry can have one or more states, which changes if the data stored in the entry has changed but only one state can be written for any particular day. The day starts at one and is incremented each day.

Example:

entity | entry  | state     | day
-------------------------------------
1      | ABC123 | 1         | 1
1      | ABC124 | 2         | 1
1      | ABC125 | 3         | 1
...
1      | ABC999 | 999       | 1
2      | BCD123 | 1000      | 1
...
1      | ABC123 | 1001      | 2
2      | BCD123 | 1002      | 3

The index is set to (entity, day, state).

What I want to achieve is to efficiently select the most current state of each entry on day N.

Currently, every week I write all the entries with their latest state to the table so to minimize the number of days that we need to scan, however, given the total number of entries (worst case scenario - 1,000 entities times 100,000,000 entries is a lot of rows to write each week) the table is slowly but surely bloats and everything becomes really slow.

I need to be able to stop writing this "full" version weekly and instead have a setup that will still be fast enough to achieve that. I considered to use DISTINCT ON with a different index set to (entity, entry, day DESC, state) so that I could:

SELECT DISTINCT ON (entity, entry) entry, state
FROM table
WHERE entity = <entity> AND day <= <day>
ORDER BY entity, entry, day DESC, state;

Would that be the most efficient way to do it or there are better ways? Or entry possibly having hundreds of millions of unique values makes it a poor choice for the second column in the index and the performance will eventually come to a halt?

Is there a possibility to redesign the schema, e.g. split this table, write to additional tables via a trigger, etc? — IMSoP, Aug 03 '20 at 18:00
I know that this does not answer your question, but have you considered using TimescaleDB for this? — Mike Organek, Aug 03 '20 at 18:09
Timescaledb is installed on PostGreSQL and compatible with v11. It essentially automatically creates new partitions in the tables for each time series and manages that for you. — MatBailie, Aug 03 '20 at 18:44
You mention `version` and `state`. Which is it? (Please [edit] the question to fix.) You also mention to `select the most current state of each entry on day N.` But then show a query to do that for *one* entity. You also mention another table and I am not sure what we are working off now. Please clarify the setup and the question. — Erwin Brandstetter, Aug 03 '20 at 19:36
And is there a table "entry" holding *one* row for each existing entry? I.e., each existing (or relevant) distinct combination of `(entity, entry)`? That would help. — Erwin Brandstetter, Aug 03 '20 at 19:50

score 0 · Answer 1 · answered Aug 03 '20 at 18:01

0

You want to rank the entries by time an take the latest one. That's the same as ranking them in reverse time orse and taking the first one. And ROW_NUMBER() is one way to do that.

WITH
    ranked AS
(
    SELECT
        *, 
        ROW_NUMBER()
            OVER (
                PARTITION BY entity, entry
                    ORDER BY day DESC
            )
              AS entity_entry_rank
    FROM
        yourTable
)
SELECT
    *
FROM
     ranked
WHERE
    entity_entry_rank = 1

The day column can then become a timestamp, and you don't need to store a new copy every day.

The appropriate index would be (entity, entry, timestamp)

Also, it's common to have two tables. One with the history, one with the latest value. That makes use of the current value quicker, at a minor disk overhead.

(Apologies for errors or formating, I'm on my phone.)

answered Aug 03 '20 at 18:01

MatBailie

83,401
18
103
137

I should have specified that, but the `day` is there because the entries are collected once per day in bulk so the time doesn't really matter. On a "normal" day, only rows for new entries are added or for the entries whose state has changed. In regards to two tables, the actual data belongs to the entity is already stored in a separate table. ```entry``` here is an external reference while ```state``` is an internal reference which is used to retrieve data from another table – afonja Aug 03 '20 at 18:23
@afonja - I'm not sure what exactly you're asking then. I think you should update your question to be much more specific and clear. Either way, the pattern above is usually the best one for finding "the latest row", and/or having one table for the latest data and a separate one for the history (managed by an ETL pipeline and/or triggers). To cover all the options would fill a book, SO is really only suited to Specific questions rather than wide-ranging discussions. – MatBailie Aug 03 '20 at 18:47

Erwin Brandstetter · Answer 2 · 2020-08-03T21:11:53.030

DISTINCT ON is simple, and performance is great - for few rows per entry. See:

Select first row in each GROUP BY group?

Not for many rows per entry, though.

Each entity can have over 100 million entries

See:

Optimize GROUP BY query to retrieve latest row per user

Assuming an entry table that holds one row for each existing entry (each relevant distinct combination of (entity, entry)), this query is very efficient to get the latest state for a given day:

SELECT e.entity, e.entry, t.day, t.state
FROM   entry e
LEFT   JOIN LATERAL (
   SELECT day, state
   FROM   tbl
   WHERE  (entity, entry) = (e.entity, e.entry)
   AND    day <= <day>  -- given day
   ORDER  BY day DESC
   LIMIT  1
   ) t ON true;
ORDER  BY e.entity, e.entry; -- optional

Use CROSS JOIN LATERAL instead of the LEFT JOIN if you only want entries that have at least one row in tbl.

The perfect index for this is on (entity, entry, day) INCLUDE (state).

If you have no table entry, consider creating one. (Typically, there should be one.) The rCTE techniques outlined in the linked answer above can also be used to create such a table.

Is there a better way to reconstruct a data from hundreds of millions of entries spread over a long period of time?

2 Answers2