MongoDB querying performance for over 5 million records

Question

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.

They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.

I've added several compound indexes with the filtered fields and the timetamp e.g:

db.events.ensureIndex({somefield: 1, timestamp:-1})

I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.

I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.

I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..

EDIT: example for a query:

> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey@acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
        "cursor" : "BtreeCursor user.userName_1_timestamp_-1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 30060,
        "nscanned" : 30060,
        "nscannedObjectsAllPlans" : 120241,
        "nscannedAllPlans" : 120241,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 26495,
        "indexBounds" : {
                "user.userName" : [
                        [
                                "nickey@acme.com",
                                "nickey@acme.com"
                        ]
                ],
                "timestamp" : [
                        [
                                {
                                        "$maxElement" : 1
                                },
                                {
                                        "$minElement" : 1
                                }
                        ]
                ]
        },
        "server" : "yarin:27017"
}

please note that deviceType has only 2 values in my collection.

nice one! I will probably have the similar scenario in future and answers might be helpful to me as well. How big is your database having that collection? How long did it take before querying those 8 KV pairs before you reach 2mil and how long is it taking now? (just curious) — anvarik, Oct 24 '13 at 07:34
Joe, yes ofcourse I'm using limits, at the moment I'm limiting my results to 25 documents. I'm not even wanna talk about skips since I'm gonna replace them with range queries in the near future. — Yarin Miran, Oct 24 '13 at 07:43
Enver, when the collection was about 1-2 million records I've started to sense some peformance issues (5-50 seconds query time). Then I've added indexes and I got reasonable peformance for querying of < 1000ms now queries take from 20ms to 60 seconds but it all depends on the value distribution of the fields that are filtered and how 'helpful' the indexes actually were. — Yarin Miran, Oct 24 '13 at 07:52
Which queries are slow? Is a simple query with no filtering already slow? Or are only queries filtered by one field slow? Or by two fields? — Joe, Oct 24 '13 at 08:03
Can you paste an explain of one of the slow queries? Also does your working set fit into RAM? Also is this when the collection is warm or cold? — Sammaye, Oct 24 '13 at 08:39
Well, even the one fields are a bit slow at first, but I guess mongo caches some of the results and then the performance is better. Same goes with multiple fields. Only most of the time the filtering takes much longer... — Yarin Miran, Oct 24 '13 at 08:42
A cold collection is considered to be true when MongoDB has never ran a query on that collection within the LRU window, normally over night or something. A warm collection is when the collection should be in memory — Sammaye, Oct 24 '13 at 08:44

score 78 · Accepted Answer · edited May 23 '17 at 10:31

This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:

Ensure it's not because of insufficient RAM and excessive paging
Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.

A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.

Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.

Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.

So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.

Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.

The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.

If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.

-- EDIT --

The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey@acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.

I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).

Making this particular query faster could be done using a compound index that contains the device type, e.g. using

a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})

or

b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})

Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.

I'm afraid there's no very good solution for this using mongodb at this time.

Thanks for the reply! Another issue we have is that on our mongo there are several client databases where each one has that huge collection. We're afraid that indexing it all of these collection will hurt performance badly since we'll need to have huge amount of RAM in order to support simultaneous queries from different users. Do you have a suggestion for a good search DB for this purpose? — Yarin Miran, Oct 24 '13 at 09:12
I guess that depends on the search features you need. For the basics, any db that supports index intersections should do. If you need full text search, faceted search or even slice & dice, things get tricky and there's a whole universe of tools, from SolR, Elastic Search up to OLAP cubes. While you're at it, you could also vote for index intersection in the MongoDB Jira: https://jira.mongodb.org/browse/SERVER-3071 — mnemosyn, Oct 24 '13 at 09:34
I thinkw we'll go for ElasticSearch for this particular table. What do you think about it ? — Yarin Miran, Oct 31 '13 at 07:09
Great answer. I would be interested to know what has changed in the past 4.5 years in this regard. — Daniel Hilgarth, Mar 05 '18 at 20:20
I would be interested to know what has changed in the past 8 years in this regard. — Stunner, Dec 03 '21 at 10:57

score 3 · Answer 2 · answered Oct 24 '13 at 08:45

Mongo only uses 1 index per query. So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.

This means that basically you'll need an index for every type of query in order to achieve the best performance.

Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app. This way you'll only need indexes on every fields, but it may be too much data to process.

score -2 · Answer 3 · answered Oct 16 '14 at 10:14

-2

If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.

http://docs.mongodb.org/manual/core/query-optimization/

answered Oct 16 '14 at 10:14

Gopal

5
1

19

FYI, $in does use an index, it's $nin that doesn't use an index. The problem in $in from what we experienced is that mongo performs the query per value in the $in. Despite using an index for each query it's extremely slow.. – Yarin Miran Oct 19 '14 at 07:11

MongoDB querying performance for over 5 million records

3 Answers3

Linked