4

We have a mongo database with around 1M documents, and we want to poll this database using a processed field to find documents which we havent seen before. To do this we are setting a new field called _processed.

To query for documents which need to be processed, we query for documents which do not have this processed field:

db.stocktwits.find({ "_processed" : { "$exists" : false } })

However, this query takes around 30 seconds to complete each time, which is rather slow. There is an index (asc) which sits on the _processed field:

db.stocktwits.ensureIndex({ "_processed" : -1 },{ "name" : "idx_processed" });

Adding this index does not change query performance. There are a few other indexes sitting on the collection (namely the ID idx & a unique index of a couple of fields in each document).

The _processed field is a long, perhaps this should be changed to a bool to make things quicker?

We have tried using a $where query (i.e. $where : this._processed==null) to do the same thing as $exists : false and the performance is about the same (few secs slower which makes sense)...

Any ideas on what would be casusing the slow performance (or is it normal)? Does anyone have any suggestions on how to improve the query speed?

Cheers!

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
NightWolf
  • 7,694
  • 9
  • 74
  • 121
  • Is simply setting the _processed field to `false` when you create the documents not an option? – Ian Mercer Sep 21 '11 at 16:03
  • Thanks for the comment. That would be workable (but as a last resort) - I would like to avoid this as the dumping processes are prodized. The _processed field is a long in this instance so we would just set it to something like 0 or -1. This is a good point however, would changing the processed field type to a bool make things faster? The only reason processed is a long is stores a timestamp of the processed time (just to help with any debugging purposes), but in reality it could be set to a bool, if we really needed it we could have _processed (bool) and _porcessed_timestamp (long)..? – NightWolf Sep 21 '11 at 16:11

2 Answers2

5

Upgrading to 2.0 is going to do this for you:

From MongoDB.org:

Before v2.0, $exists is not able to use an index. Indexes on other fields are still used.

joostdevries
  • 930
  • 1
  • 6
  • 13
4

Its slow because checking for _processed -> not exists doesnt offer much selectivity. Its like having an index on "Gender" - and since there are only two possible options male or female then if you have 1M rows and an index on Gender it will have to scan 50% or 500K rows to find all males.

You need to make your index more selective.

Cody Caughlan
  • 32,456
  • 5
  • 63
  • 68
  • Thanks for your comment Cody. I see your point, in Mongo all null fields are indexed. So if I have a dataset that is mostly processed its not that bad a scan, it beats having to scan each document. Even if I made _processed = false, its really 6 one half dozen the other as _processed = false or _processed = null for all intensive purposes are equiv... – NightWolf Sep 21 '11 at 16:31
  • I don't think this is relevant. MongoDB returns results using a cursor. Provided the values are indexed there is no 'scan', the results can be streamed back as fast as you can grab them. The issue here is that the search wasn't using the index. – Ian Mercer Sep 22 '11 at 06:13