3

I have an OR query which I'm currently using for a semi large update. Essentially my collection is split into two data sets;

1 main repository and 1 subset of the main repository. This is just to allow quicker searching on a small subset of data.

I'm finding however my query which I create to pull things into the subset is timing out.. and when looking at the explain it looks like two queries are actually happening.

PRIMARY> var date = new Date(2012,05,01);
PRIMARY> db.col.find(
  {"$or":[
      {"date":{"$gt":date}},
      {"keywords":{"$in":["Help","Support"]}}
   ]}).explain();

This produces:

{
"clauses" : [
    {
        "cursor" : "BtreeCursor ldate_-1",
        "nscanned" : 1493872,
        "nscannedObjects" : 1493872,
        "n" : 1493872,
        "millis" : 1035194,
        "nYields" : 3396,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {
            "ldate" : [
                [
                    ISODate("292278995-01--2147483647T07:12:56.808Z"),
                    ISODate("2012-06-01T07:00:00Z")
                ]
            ]
        }
    },
    {
        "cursor" : "BtreeCursor keywords_1 multi",
        "nscanned" : 88526,
        "nscannedObjects" : 88526,
        "n" : 2515,
        "millis" : 1071902,
        "nYields" : 56,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {
            "keywords" : [
                [
                    "Help",
                    "Help"
                ],
                [
                    "Support",
                    "Support"
                ]
            ]
        }
    }
],
 "nscanned" : 1582398,
 "nscannedObjects" : 1582398,
 "n" : 1496387,
 "millis" : 1071902
}

Is there something I can be indexing better to make this faster? Seems just way to slow...

Thanks ahead of time!

Petrogad
  • 4,405
  • 5
  • 38
  • 77
  • whats your index? is just a date filed? – RameshVel Nov 02 '12 at 13:17
  • I've tried creating separate indexes on both keywords and date; and tried creating them as a compoundindex across the two. – Petrogad Nov 02 '12 at 13:20
  • Hmm You are fetching 1.4m records there....that's quite a few, you could try extending the timer on the mongo cursor for such a large query. Also what is your setup here? 17m is a very long time I admit – Sammaye Nov 02 '12 at 14:40
  • @Sammaye I'm trying to take the records and fetch all matching a criteria, then move them into another collection with only a subset of the actual fields, creating a searchable dataset per-say. – Petrogad Nov 02 '12 at 14:45
  • Yea your trying to aggregate for another collection, am I right? Are you doing it client-side or is this MR based? If client based what language (it might make some difference depending on possible driver bugs)? – Sammaye Nov 02 '12 at 15:04
  • @Sammaye doing it client side (php) through the CLI interface in a batched cron job. – Petrogad Nov 02 '12 at 15:12
  • Hmmm I know of no problems that should effect you here, you always try the mongodb-user google group for this too. My own personal knowledge of $or index usage is limited in this question and all the signs say it shouldn't take 17 mins to pick out 2515 rows from your second clause. The only thing I can think is that $in does not use the index properly as such that is why you are scanning 88k records. – Sammaye Nov 02 '12 at 15:22
  • @Sammaye ill post in there and see if maybe there is some other bug going on.. I did notice that when changing my index to another date field I had it was near instant. – Petrogad Nov 02 '12 at 15:29
  • Interesting I wonder if the highest isodate value: `ISODate("292278995-01--2147483647T07:12:56.808Z")` is causing some kind of leak in the index. – Sammaye Nov 03 '12 at 11:46
  • How big is your data set? And how much RAM do you have on this server? It would be helpful to post a gist/pastebin snippet of `mongostat` output collected while the query is running. Also, what version of MongoDB are you using? – Stennie Nov 03 '12 at 13:13

2 Answers2

1

An $or query will evaluate each clause separately and combine the results to remove duplicates .. so if you want to optimize the queries you should first try to explain() each clause individually.

It looks like part of the problem is that you are retrieving a large number of documents while actively writing to that collection, as evidenced by the high nYields (3396). It would be worth reviewing mongostat output while the query is running to consider other factors such as page faulting, lock %, and read/write queues.

If you want to make this query faster for a large number of documents and very active collection updates, two best practice approaches to consider are:

1) Pre-aggregation

Essentially this is updating aggregate stats as documents are inserted/updated so you can make fast real-time queries. The MongoDB manual describes this use case in more detail: Pre-Aggregated Reports.

2) Incremental Map/Reduce

An incremental Map/Reduce approach can be used to calculate aggregate stats in successive batches (for example, from an hourly or daily cron job). With this approach you perform a Map/Reduce using the reduce output option to save results to a new collection, and include a query filter that only selects documents that have been created/updated since the last time this Map/Reduce job was run.

Stennie
  • 63,885
  • 14
  • 149
  • 175
0

I think you should create a compound index on both date and keywords. Refer to the below post for more specifics based on your use-case

how to structure a compound index in mongodb

Community
  • 1
  • 1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
  • just checked; however it seems like OR conditions are still hitting two indexes. Figured this was going to then slow it down as it would have to look at the exact same index twice, only using half of the information. – Petrogad Nov 02 '12 at 13:21
  • 1
    An $or can use two index plans and will, it is a special case of mongo querying. It derives from, fundamentally how a $or is done in Mongo – Sammaye Nov 02 '12 at 14:44
  • A compound index will not be helpful; as Sammaye mentions [`$or` uses multiple indexes](http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24or) (one per clause) and combines the results to remove duplicates. – Stennie Nov 03 '12 at 13:08