PyMongo: How To Update A Collection Using Aggregate?

Question

This is a continuation of this question.

I'm using the following code to find all documents from collection C_a whose text contains the word StackOverflow and store them in another collection called C_b:

import pymongo
from pymongo import MongoClient
client = MongoClient('127.0.0.1')  # mongodb running locally
dbRead = client['C_a']            # using the test database in mongo
# create the pipeline required 
pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"C_b"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)  #execution 
print (dbRead.C_b.count()) ## verify count of the new collection

This works great, however, if I run the same snippet for multiple keywords the results get overwritten. For example I want the collection C_b to contain all documents that contain the keywords StackOverflow, StackExchange, and Programming. To do so I simply iterate the snippet using the above keywords. But unfortunately, each iteration overwrites the previous.

Question: How do I update the output collection instead of overwriting it?

Plus: Is there a clever way to avoid duplicates, or do I have to check for duplicates afterwards?

[$out will overwrite the collection if it exists](https://docs.mongodb.com/manual/reference/operator/aggregation/out/). Why do you need to create new collections? Why can't the requirement be satisfied by querying the original collection instead? — kevinadi, May 29 '18 at 23:40
@KevinAdistambha The above is a toy example. In truth, I have a very large collection of documents from which I want to extract all documents containing a keyword from a list of keywords (more than 200) and study them in various axes. To do so I want to create a collection with these specific documents. Is there now way of doing such a thing? — Aventinus, May 30 '18 at 10:55
The nice "actual MongoDB employee" pointed you directly to the documentation that tells you that your "ask" is not possible. The only options are A. New collection using `$out`. B. Iterate results on a returned cursor and write updates back. Where of course B means transferring results and updates back "over the wire" which seems like what you are exactly trying to avoid. You should have paid attention to the very clear lesson.\ — Neil Lunn, Jun 01 '18 at 12:24

score 2 · Answer 1 · answered Jun 01 '18 at 12:26

2

If you look at the documentation $out doesn't support update

https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out

So you need to do a two stage operation

pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"temp"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)

and then use approach discussed in

https://stackoverflow.com/a/37433640/2830850

dbRead.C_b.insert(
   dbRead.temp.aggregate([]).toArray()
)

And before starting the run you will need to drop the C_b collection

answered Jun 01 '18 at 12:26

Tarun Lalwani

142,312
9
204
265

So the whole point of `$out` is to avoid the 16MB BSON limit. You propose to then read that whole collection into an `insert()` which also has that same 16MB limit. That's not going to work in any practical situation. Also that's not an "update" anyway. – Neil Lunn Jun 01 '18 at 12:31
Then only other way would be to somehow update your aggregation to handle multiple values instead of doing it one step at a time – Tarun Lalwani Jun 01 '18 at 12:33
Point is this is wrong. Hence the comment to let the poor person who did not understand the very clear documentation that this is indeed an incorrect answer. – Neil Lunn Jun 01 '18 at 12:33
@TarunLalwani `dbRead.C_b.insert(dbRead.temp.aggregate([]).toArray())` returns a `AttributeError: 'CommandCursor' object has no attribute 'toArray'` error. – Aventinus Jul 02 '18 at 11:37
try `dbRead.C_b.insert(list(dbRead.temp.aggregate([])))` – Tarun Lalwani Jul 02 '18 at 11:39

PyMongo: How To Update A Collection Using Aggregate?

1 Answers1