What is a good horizontal scaling strategy for a MongoDB change stream reader?

Question

I am thinking of implementing a MongoDB change stream reader and I want to make sure I'm doing it correctly. There are plenty of simple examples out there of how to implement the actual reader code, including the official documentation, and I'm not too worried about that aspect of it.

I am however a little worried about the reader falling behind the change stream and not being able to catch up and I want to make sure the reader can handle the flow.

The mongo server is a cluster and lets assume for the sake of argument that it is quite busy at all times of day. The change stream API appears to only be compatible with a single instance doing the work given how it must iterate the stream results rather than operate on it like a queue. Therefore I am worried that it is possible that the single instance iterating the results could take longer to do its work than new items are pushed into the stream.

My instinct is to actually have the reader simply read the stream, batch the changes and then push it into a queue where other workers can then horizontally scale to do the work. However I still have a single instance as the reader and its still theoretically possible for it to fall behind the stream, even while only doing the bare minimum work of pushing changes into a queue.

So my questions are, how realistic of a worry is this and is there any way to create the reader in such a way that it can horizontally scale even if it is only streaming the changes into a worker queue? What other considerations should I take into account?

Interesting question. I don't believe there is a backpressure system that can regulate the flow of changestreams coming from MongoDB in case the reader cannot catch up. I guess the only metric you can check is the expected inflow of data into the watched collection, and provision the reader to be able to handle that traffic. You may be able to scale horizontally since changestream is per-collection (also per-database/per-deployment as MongoDB 4.0), so you can provision watchers on different machines if you're watching multiple collections/db. — kevinadi, Jan 21 '19 at 22:56
Hmm, that's a good point... And actually if you go even further, since you could have filters you could theoretically have a filter such as `{ _id: /^a/ }` and another with `{ _id: /^b/ }`, etc. If I needed more than 16 then `{ _id: /^aa }`, and on and on... I'm not sure I'll need more than one :) but I just wanted to make sure there was some kind of strategy I could fall back to before committing. I think that using this filter strategy I could probably add as many servers as I needed. The trick would be coordinating the servers but it seems solvable. — justin.m.chase, Jan 23 '19 at 16:23

score 16 · Accepted Answer · answered Jan 23 '19 at 18:10

Most likely a single reader can probably suffice by simply delegating all of the work to a horizontally scaled queue.

If it turns out that that is insufficient and your reader still needs to horizontally scale then you may be able to achieve that by using a match filter in such a way that it would allow multiple readers to divide up the work.

For example if you have an id with hexidecimal characters you could split up the work onto two servers by using a match operator on each server where each server matches on half of the characters in the full range:

// Change Stream Reader 1
const params = [
  { $match: { _id: /^[0-7]/ } }
];
const collection = db.collection('inventory');
const changeStream = collection.watch(params);

The on a second machine:

// Change Stream Reader 2
const params = [
  { $match: { _id: /^[8-9a-f]/ } }
];
const collection = db.collection('inventory');
const changeStream = collection.watch(params);

If you need to have more than 16 servers then you can make the range even more specific:

// Server 0  matches on /^0[0-7]/
// Server 1  matches on /^1/
// ...
// Server 15 matches on /^f/
// Server 16 matches on /^0[8-9a-f]/

This will allow each machine to watch a subset of messages and process them while other machines are processing other messages without duplication.

Coordinating which server is watching which range in a robust way becomes somewhat complex as you need to ensure that a crashed or hung machine resumes and if you need to dynamically scale horizontally then you need to be able to deliver new ranges to the servers and it resizes. This solution will also cause the messages to be processed out of order, so if order is important then you'll need to come up with a solution for re-ordering messages or dealing with out of sequence issues.

But these are all different topics from this question so I will leave out the details for now.

What is a good horizontal scaling strategy for a MongoDB change stream reader?

1 Answers1