Get a collection and then changes to it without gaps or overlap

Question

How do I reliably get the contents of a table, and then changes to it, without gaps or overlap? I'm trying to end up with a consistent view of the table over time.

I can first query the database, and then subscribe to a change feed, but there might be a gap where a modification happened between those queries.

Or I can first subscribe to the changes, and then query the table, but then a modification might happen in the change feed that's already processed in the query.

Example of this case:

A subscribe 'messages'
B add 'messages' 'message'
A <- changed 'messages' 'message'
A run get 'messages'
A <- messages

Here A received a 'changed' message before it sent its messages query, and the result of the messages query includes the changed message. Possibly A could simply ignore any changed messages before it has received the query result. Is it guaranteed that changes received after a query (on the same connection) were not already applied in the previous query, i.e. are handled on the same thread?

What's the recommended way? I couldn't find any docs on this use case.

Could you post an example of "but then a modification might happen in the change feed that's already processed in the query."? IMO, it would make more sense to subscribe first to a query and then make changes, but I'm not sure I'm seeing the problem you're suggesting. — Jorge Silva, Mar 02 '15 at 17:37
Thanks, as I typed an example I thought of a way that might work. — Tinco, Mar 03 '15 at 14:47

score 1 · Answer 1 · answered Mar 03 '15 at 15:14

I know you said you came up with an answer but I've been doing this quite a bit and here is what I've been doing:

r.db('test').table('my_table').between(tsOne, tsTwo, {index: 'timestamp'});

So in my jobs, I run an indexed between query which captures data between last run time and that exact moment. You can run a lock on the config table which tracks the last_run_time for your jobs so that you can even scale with multiple processors! And because we are using between the next job that is waiting for the lock to complete will only grab data after the first processor ran. Hope that helps!

Thanks, using a timestamp is also a good hint, someone of RethinkDB gave a more complete answer that I've pasted in a separate answer. — Tinco, Mar 04 '15 at 11:35

score 1 · Accepted Answer · answered Mar 04 '15 at 11:32

Michael Lucy of RethinkDB Wrote:

For .get.changes and .order_by.limit.changes you should be fine because we already send the initial value of the query for those. For other queries, the only way to do that right now is to subscribe to changes on the query, execute the query, and then read from the changefeed and discard any changes from before the read (how to do this depends on what read you're executing and what legal changes to it are, but the easiest way to hack it would probably be to add a timestamp field to your objects that you increment whenever you do an update).

In 2.1 we're planning to add an optional argument return_initial that will do what I just described automatically and without any need to change your document schema.

Get a collection and then changes to it without gaps or overlap

2 Answers2