5

This is a continuation of this question.

I'm using the following code to find all documents from collection C_a whose text contains the word StackOverflow and store them in another collection called C_b:

import pymongo
from pymongo import MongoClient
client = MongoClient('127.0.0.1')  # mongodb running locally
dbRead = client['C_a']            # using the test database in mongo
# create the pipeline required 
pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"C_b"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)  #execution 
print (dbRead.C_b.count()) ## verify count of the new collection 

This works great, however, if I run the same snippet for multiple keywords the results get overwritten. For example I want the collection C_b to contain all documents that contain the keywords StackOverflow, StackExchange, and Programming. To do so I simply iterate the snippet using the above keywords. But unfortunately, each iteration overwrites the previous.

Question: How do I update the output collection instead of overwriting it?

Plus: Is there a clever way to avoid duplicates, or do I have to check for duplicates afterwards?

Aventinus
  • 1,322
  • 2
  • 15
  • 33
  • 1
    [$out will overwrite the collection if it exists](https://docs.mongodb.com/manual/reference/operator/aggregation/out/). Why do you need to create new collections? Why can't the requirement be satisfied by querying the original collection instead? – kevinadi May 29 '18 at 23:40
  • @KevinAdistambha The above is a toy example. In truth, I have a very large collection of documents from which I want to extract all documents containing a keyword from a list of keywords (more than 200) and study them in various axes. To do so I want to create a collection with these specific documents. Is there now way of doing such a thing? – Aventinus May 30 '18 at 10:55
  • The nice "actual MongoDB employee" pointed you directly to the documentation that tells you that your "ask" is not possible. The only options are A. New collection using `$out`. B. Iterate results on a returned cursor and write updates back. Where of course B means transferring results and updates back "over the wire" which seems like what you are exactly trying to avoid. You should have paid attention to the very clear lesson.\ – Neil Lunn Jun 01 '18 at 12:24

1 Answers1

2

If you look at the documentation $out doesn't support update

https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out

So you need to do a two stage operation

pipeline = [{"$match": {"$text": {"$search":"StackOverflow"}}},{"$out":"temp"}]  # all attribute and operator need to quoted in pymongo
dbRead.C_a.aggregate(pipeline)

and then use approach discussed in

https://stackoverflow.com/a/37433640/2830850

dbRead.C_b.insert(
   dbRead.temp.aggregate([]).toArray()
)

And before starting the run you will need to drop the C_b collection

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • So the whole point of `$out` is to avoid the 16MB BSON limit. You propose to then read that whole collection into an `insert()` which also has that same 16MB limit. That's not going to work in any practical situation. Also that's not an "update" anyway. – Neil Lunn Jun 01 '18 at 12:31
  • Then only other way would be to somehow update your aggregation to handle multiple values instead of doing it one step at a time – Tarun Lalwani Jun 01 '18 at 12:33
  • Point is this is wrong. Hence the comment to let the poor person who did not understand the very clear documentation that this is indeed an incorrect answer. – Neil Lunn Jun 01 '18 at 12:33
  • @TarunLalwani `dbRead.C_b.insert(dbRead.temp.aggregate([]).toArray())` returns a `AttributeError: 'CommandCursor' object has no attribute 'toArray'` error. – Aventinus Jul 02 '18 at 11:37
  • try `dbRead.C_b.insert(list(dbRead.temp.aggregate([])))` – Tarun Lalwani Jul 02 '18 at 11:39