1

Hello and happy new year to everybody.

This question is because I have a problem and I don't know what to look for as a solution. I think that aggregation is no good here.

I am a novice level programmer starting with python 3.6 and pymongo 3.4 to handle Mongodb.

I have 50M documents inside a "lostItems" collection inside a "warehouse" database with this structure:

{
    "_id": "5a4d018526f9c63894d11254",
    "time": "2000-01-03T00:02:00.000Z",
    "item": "sunglasses"
    "level": 45,
    "section": 15,
    "box": 29, 
    "home": "CA"
}

"_id" field is the normal autogenerated key. "time" is not a key and is not enforced to be unique, but it happens to be so, for now.

Let's see if I can explain clearly the query I want:

  • It will order documents by "time" ("time", pymongo.DESCENDING)

  • It will look for, and note down somehow, every "home": "CA" for use with the second query.

  • It will return, for each '"home": "CA"' hit, not just that document but also the 5 previous documents in the specified order (whatever the "home" value might be) and the later 10 documents.

In other words, the query will return a total of 16 documents, that are consecutive when ordered by "time", for each '"home": "CA"' found.

My approach (as a novice) would be to query for the hits and write them down in a temporal db and then make another query to get and add the previous and later documents required. This feels a little bit complicated and mostly slow query.

import pymongo
col = pymongo.MongoClient('localhost',  27017).warehouse.lostItems
query = {"home": "CA"}
order = [ ("time",  pymongo.DESCENDING) ]

try:
    for doc in col.find(query).sort(order):
        **"Now I don't know what to do"**
except Exception as e:
    print ("Error :",  type(e), e)

Could you please give me some advice and/or tell me what is this kind of query called so that I can look for more similar cases to learn about?

Thank you very much in advance.

Deus
  • 11
  • 2

1 Answers1

0

You want a window function but this doesn't exist in MongoDB so you'll have to do it in Python using the itertools package:

from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result
Hatshepsut
  • 5,962
  • 8
  • 44
  • 80
  • Thank you very much Hatshepsut. I'm kind of a newbie but as far as I, in my limited knowledge, understand, to do that I would have to bring the whole database to work with the elements so that I can get the results. Thats about 5 GB everytime I make a query. Am I getting it wrong? – Deus Jan 05 '18 at 10:06
  • @Deus are you trying to get information about one item and the 16 surrounding items or every item and the 16 surrounding items for each one? – Hatshepsut Jan 05 '18 at 10:14
  • Im trying every item and the surrounding 15 for each one. This is, if there are 1000 '"home": "CA"' in the collection, I would get 16000 documents back to later work with. – Deus Jan 05 '18 at 10:21
  • @Ha sorry, I forgot to say that the later work will use the data packed 16 by 16, not all the hypothetical result of 16000 mixed together. I could solve this grouping by myself anyway. – Deus Jan 05 '18 at 10:30
  • @Deus You'll get 1000 documents from the database with the `find(query)`. You can turn that into 1000 lists of 16 documents using a small modification of `window(documents)`, or by passing the result of `window` as-is to a chunker function. – Hatshepsut Jan 05 '18 at 18:15
  • Let's see if I am getting it. Through find(query) I'll get the cursor of the hypothetical 1000 documents. I'll have to iterate with a for-loop to get the 1000 "_id" and store them in a list. I think that now I would need another for-loop to look for each of these "_id" and then make the query for the 16 documents. Ok, first document could be "_id": "5a4d018526f9c63894d11254", so I look for it, and now, how do I get the other 15? How do I tell Mongo that I want the 5 previous and 10 later documents in the "time" descending order? What would that query look like? – Deus Jan 05 '18 at 21:56
  • @Deus If you want the closest items of any kind, you can do that by getting the timestamps and using `$sort` and `$limit`. If you want the closest items matching your original query, you can do that in python as above. – Hatshepsut Jan 05 '18 at 22:21
  • Yes, as explained above, the hits have to match the query of "home":"CA" but not the surrounding 15 items of each hit that have to be the closest to each hit according to timestamp, without any further filters. So, following you, I understand that I will have to do, for each of the "hits" of the first query, two limited queries one in each sort direction. If the first query hypothetically has 1000 results, this would mean 2000 more queries to reach the goal. I have never programmed anything querying so much, but if it's the way to go ... – Deus Jan 06 '18 at 12:27