5

Stages of MongoDB aggregation pipeline are always executed sequentially. Can the documents that the pipeline processes be changed between the stages? E.g. if stage1 matches some docs from collection1 and stage2 matches some docs from collection2 can some documents from collection2 be written to during or just after stage1 (i.e. before stage2)? If so, can such behavior be prevented?

Why this is important: Say stage2 is a $lookup stage. Lookup is the NoSQL equivalent to SQL join. In a typical SQL database, a join query is isolated from writes. Meaning while the join is being resolved, the data affected by the join cannot change. I would like to know if I have the same guarantee in MongoDB. Please note that I am coming from noSQL world (just not MongoDB) and I understand the paradigm well. No need to suggest e.g. duplicating the data, if there was such a solution, I would not be asking on SO.


Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved. However, MongoDB documentation does not say anything about aggregation pipeline locks. Does aggregation pipeline hold read (shared) locks to all the collections it reads? Or just to the collection used by the current pipeline stage?

More context: I need to run a "query" with multiple "joins" through several collections. The query is generated dynamically, I do not know upfront what collections will be "joined". Aggregation pipeline is the supposed way to do that. However, to get consistent "query" data, I need to ensure that no writes are interleaved between the stages of the pipeline.

E.g. a delete between $match and $lookup stage could remove one of the joined ("lookuped") documents making the entire result incorrect/inconsistent. Can this happen? How to prevent it?

Rasto
  • 17,204
  • 47
  • 154
  • 245
  • Your question is focused on the wrong details relative to what seems to be your overall goal. Based on your last paragraph and especially the question at the very end, you might want to take a look at https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot-. It may also be worth reviewing your overall architecture and goals as such a strict level of isolation guarantee is atypical and may lead to operational challenges later on – user20042973 Jan 18 '23 at 00:51
  • @user20042973 Thank you for the answer and the link. I would differ when it comes to "focus on the wrong detail" because if Mongo's aggregation pipeline really acquires read (shared) lock to the collections it used in the pipeline then the example in the last paragraph is solved. And I am afraid we really require this level of isolation because the product being created is a generic solution usable for different apps. Therefore, we can make few assumptions about business logic executing the pipeline and its requirements... – Rasto Jan 18 '23 at 23:20
  • ...that being said, we do not have a MongoDB expert (or even an advanced user) on the team so it is entirely possible that our approach is wrong altogether. – Rasto Jan 18 '23 at 23:21
  • @ray could you elaborate how transaction apply to the aggregation? If I understand the question, Rasto is asking about causal consistency, and the first comment already have a link to the snapshot read. It's not quite clear how you recommend to use transactions in this particular case. – Alex Blex Jan 21 '23 at 00:31
  • @AlexBlex Thank you. Guess I have mistaken OP's scenario. Do you think the [readConcern](https://www.mongodb.com/docs/manual/reference/method/db.collection.aggregate/#definition) option in aggregate is what OP's looking for? – ray Jan 21 '23 at 01:07
  • @ray If the snapshot option in aggregate set to snapshot will make all the aggregate stages run with a common snapshot isolation that it is what I am looking for. Otherwise, can I just wrap aggregate in a transaction that has snapshot readConcern? Would that work for all aggregate stages? Please note that I do not intent do do any writes in my aggregate (e.g. no merge stage). – Rasto Jan 21 '23 at 03:17

3 Answers3

3

@user20042973 already provided a link to https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot- in the very first comment, but considering followup comments and questions from OP regarding transactions, it seems it requires full answer for clarity.

So first of all transactions are all about writes, not reads. I can't stress it enough, so please read it again - transaction, or how mongodb introduced the "multidocument transactions" are there to ensure multiple updates have a single atomic operation "commit". No changes made within a transaction are visible outside of the transaction until it is committed, and all of the changes become visible at once when the transaction is committed. The docs: https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity

The OP is concerned that any concurrent writes to the database can affect results of his aggregation operation, especially for $lookup operations that query other collections for each matching document from the main collection.

It's a very reasonable consideration, as MongoDB has always been eventually consistent and did not provide guarantees that such lookups will return the same results if the linked collection were changed during aggregation. Generally speaking it doesn't even guarantee a unique key is unique within a cursor that uses this index - if a document was deleted, and then a new one with same unique key was inserted there is non-zero chance to retrieve both.

The instrument to workaround this limitation is called "read concern", not "transaction". There are number of read concerns available to balance between speed and reliability/consistency: https://www.mongodb.com/docs/v6.0/reference/read-concern/ OP is after the most expensive one - "snapshot", as ttps://www.mongodb.com/docs/v6.0/reference/read-concern-snapshot/ put it:

A snapshot is a complete copy of the data in a mongod instance at a specific point in time.

mongod in this context spells "the whole thing" - all databases, collections within these databases, documents within these collections.

All operations within a query with "snapshot" concern are executed against the same version of data as it was when the node accepted the query.

Transactions use this snapshot read isolation under the hood and can be used to guarantee consistent results for $lookup queries even if there are no writes within the transaction. I'd recommend to use read concern explicitly instead - less overhead, and more importantly it clearly shows the intent to devs who are going to maintain your app.

Now, regarding this part of the question:

Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved.

It would be nice to have source of this claim. As of today (v5.0+) aggregation is lock-free, i.e. it is not blocked even if other operation holds an exclusive X lock on the collection: https://www.mongodb.com/docs/manual/faq/concurrency/#what-are-lock-free-read-operations-

When it cannot use lock-free read, it gets intended shared lock on the collection. This lock prevents only write locks on collection level, like these ones: https://www.mongodb.com/docs/manual/faq/concurrency/#which-administrative-commands-lock-a-collection-

IS lock on a collections still allows X locks on documents within the collection - insert, update or delete of a document requires only intended IX lock on collection, and exclusive X lock on the single document being affected by the write operation.


The final note - if such read isolation is critical to the business, and you must guarantee strict consistency, I'd advise to consider SQL databases. It might be more performant than snapshot queries. There are much more factors to consider, so I'll leave it to you. The point is mongo shines where eventual consistency is acceptable. It does pretty good with causal consistency within a server session, which gives enough guarantee for much wider range of usecases. I encourage you to test how good it will do with snapshots queries, especially if you are running multiple lookups, which can by its own be slow enough on larger datasets and might not even work without allowing disk use.

Alex Blex
  • 34,704
  • 7
  • 48
  • 75
  • Alex, thank you for elaborate answer. Since the question was asked, I have of course learned a lot about read concern and snapshots thanks to comments and previous answer. It certainly gives the garantue we need but I am also concerned about throughput. We do have a very particular use case and MongoDB was initially selected for easy of use especially for watching collections changes (the watch command). We did consider Postgres but writting triggers just to observe data changes in tables seem unnecessary tedious. – Rasto Jan 26 '23 at 15:33
  • 1
    Fair enough. Mongodb shines when you need performance, scalability and durability. It's anything but easy when you need consistency. Also schemaless nature of the DB can be a blessing or can bring a lot of headache, especially for heterogeneous systems. Talking about change streams - test carefully how it handles re-election, as it ignores "majority" read concern starting from v4.2 Plan watcher's capacity - you cannot load balance them. Complex/slow watchers may be an issue for massive updates. – Alex Blex Jan 27 '23 at 00:25
  • Can you provide a link for the source of the claim that change steams ignore majority read concern? Is there a way around it? – Rasto Jan 28 '23 at 16:47
  • Also, now that this answer is highly promoted by the highest possible bounty award, you might want to edit some misleading claims that I initially ignored. Specifically that "Transactions has nothing to do with consistency for read operations". That is not true as you can set "snapshot" read concetn on a "read-only transaction" which will pretty much solve my problem. It is possible that is what we will end up doing. – Rasto Jan 28 '23 at 16:51
  • yeah, I see how this phrase might be confusing. Poor word choice. Reworded for clarity. https://www.mongodb.com/docs/manual/changeStreams/#availability, the "Read Concern "majority" Enablement." part. – Alex Blex Jan 28 '23 at 17:41
  • The way I understand the linked source, they only say that starting from v. 4.2 watch can be used even when majority read concern is not enabled. "Starting in MongoDB 4.2, ..., read concern majority support can be either enabled (default) or disabled to use change streams. There is nothing saying that the majority read concern "is ignored" starting 4.2. as you claim. The way I read it, the majority read concern is still possible but it is now optional. Do you perhaps have another source to support the statement or is it based on experience? – Rasto Jan 29 '23 at 18:39
  • Actually, I found [this](https://www.mongodb.com/basics/change-streams) MongoDB documentation and under *Durable* feature of change streams it says: "Change streams only include majority-committed changes. This is so every change seen by listening applications is durable in failure scenarios, such as electing a new primary.". However, this might be outdated documentation as it links MongoDB 3.6 release. Is it outdated? – Rasto Jan 29 '23 at 21:15
  • @Rasto, yep the behaviour was much more clear when it required explicit majority read concern. Now it works regardless of the setting, hence "is ignoring". I couldn't find any docs how it handles unconfirmed writes on re-election, which leaves only test driven approach. You know your set up and how often switchover happens to estimate how much testing you can afford there. – Alex Blex Jan 30 '23 at 16:04
  • 1
    I can also ask at MongoDb forums. – Rasto Jan 30 '23 at 19:05
2

Q: Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?

A: Depending on how the transactions are isolated from each other.

Snapshot isolation refers to transactions seeing a consistent view of data: transactions can read data from a “snapshot” of data committed at the time the transaction starts. Any conflicting updates will cause the transaction to abort.

MongoDB transactions support a transaction-level read concern and transaction-level write concern. Clients can set an appropriate level of read & write concern, with the most rigorous being snapshot read concern combined with majority write concern.

To achieve it, set readConcern=snapshot and writeConcern=majority on connection string/session/transaction (but not on database/collection/operation as under a transaction database/collection/operation concern settings are ignored).

Q: Do transactions apply to all aggregation pipeline stages as well?

A: Not all operations are allowed in transaction.

For example, according to mongodb docs db.collection.aggregate() is allowed in transaction but some stages (e.g $merge) is excluded. Full list of supported operation inside transaction: Refer mongodb doc.

egasimov
  • 41
  • 5
  • This sounds like what I need. Do transactions apply to aggregation pipelines as well? That is, can I start a transaction, set its readConcert to snapshot (or leave it as is since it is default) and will that apply snapshot isolation to all my aggregation query stages? – Rasto Jan 21 '23 at 03:12
  • Could you perhaps provide code sample? – Rasto Jan 21 '23 at 03:19
  • **Q: Do transactions apply to aggregation pipelines as well?** *A: Some of the stages are not allowed in transaction.* But worth to mention that according to mongodb [docs](https://www.mongodb.com/docs/manual/core/transactions-operations/#crud-operations) ```db.collection.aggregate()``` is allowed in transaction but some stages (e.g```$merge``` stage) is excluded. See more [here](https://www.mongodb.com/docs/manual/core/transactions-operations/#transactions-operations-crud). – egasimov Jan 21 '23 at 08:20
  • There are lots of code snippet in the mongo [docs](https://www.mongodb.com/docs/manual/core/transactions/#read-concern-write-concern-read-preference). You may find appropriate code snippets with favor programming language – egasimov Jan 21 '23 at 08:22
-1

Yes, MongoDB documents processed by an aggregation pipeline can be affected by external writes during pipeline execution. This is because the MongoDB aggregation pipeline operates on the data at the time it is processed, and it does not take into account any changes made to the data after the pipeline has started executing.

For example, if a document is being processed by the pipeline and an external write operation modifies or deletes the same document, the pipeline will not reflect those changes in its results. In some cases, this may result in incorrect or incomplete data being returned by the pipeline.

To avoid this situation, you can use MongoDB's snapshot option, which guarantees that the documents returned by the pipeline are a snapshot of the data as it existed at the start of the pipeline execution, regardless of any external writes that occur during the execution. However, this option can affect the performance of the pipeline.

Alternatively, it is possible to use a transaction in MongoDB 4.0 and later versions, which allows to have atomicity and consistency of the write operations on the documents during the pipeline execution.

ray
  • 11,310
  • 7
  • 18
  • 42