We have a scenario like this:
- Millions of records (Record 1, Record 2, Record 3...)
- Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
- Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
- To which group did Record 15544 belong, three years ago? (Call this Group g).
- What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
- Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
- Find all records that have at any time belonged to Group g.
- For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.