Fast NOSQL queries when dealing with large datasets

Question

I'm currently trying to convince management that we should move some of our data away from MS SQL and in to NOSQL (Probably MongoDB.) Specifically what I want to move is our WebStats system. Currently we have roughly 150 million rows in a table and this dataset is always growing (we store a years worth of stats.)

As a test I've run the following query 150 million times:

db.test.insert({ SiteId:1, PageUrl:"/home/", Impressions:1, Date: new Date(), IsCrawler:false, LanguageId:2057, ClientIpAddress:"1.2.3.4", DateTime: new Date( ), ReferalUrl: "http://www.google.com", UniqueUserGuid:1, BrowserName:"IE", Brow serVersion:11, BrowserAgent:"blah", IsAbcValid:true, hasChecked:true, connection Speed:1, Country:"UK", Region:"Midlands", City:"Coventry" })

I then execute this once:

db.test.insert({ SiteId:1, PageUrl:"/home/", Impressions:1, Date: new Date(), IsCrawler:false, LanguageId:2057, ClientIpAddress:"1.2.3.4", DateTime: new Date( ), ReferalUrl: "http://www.google.com", UniqueUserGuid:1, BrowserName:"IE", Brow serVersion:11, BrowserAgent:"blah", IsAbcValid:true, hasChecked:true, connection Speed:1, Country:"US", Region:"New York", City:"New York" })

Followed by:

db.test.ensureIndex( { "PageUrl": 1, "Date": 1, "ClientIpAddress": 1 } )

After the indexing has finished I ran the following search:

db.test.find({Country:/S/})

It eventually found the US document that I added but it took longer than it would in MS SQL. Am I indexing this incorrectly? I'm basically just trying to knock up a demonstration of the possible performance gains, so if anyone could point me to an example that deals with very large data sets then I'll gladly use that instead.

Thanks,

Joe

score 3 · Answer 1 · answered Dec 11 '13 at 09:57

3

You have to index by country to search by it. If you wan't to use a regex like that, the index doesn't typically help unless it's at the front of a string.

db.test.find({Country:{$regex:"^US"}})

answered Dec 11 '13 at 09:57

jwillis0720

4,329
8
41
74

score 1 · Accepted Answer · answered Dec 11 '13 at 09:53

1

You should create an index like db.test.ensureIndex({Country:1}) if you are running query on Country.

answered Dec 11 '13 at 09:53

Parvin Gasimzade

25,180
8
56
83

Sorry, I'm an idiot. I was planning on querying PageURL and then change my mind. I'm not with it this morning. – JoeS Dec 11 '13 at 09:56
1

That's a good catch, but not going to have much impact on the query in the question :) – AD7six Dec 11 '13 at 14:08

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

1

Inefficient query

The index in the question is on the wrong field, but assuming that's corrected the query will still be pretty slow.

From the docs:

$regex can only use an index efficiently when the regular expression has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.

This query:

db.test.find({Country:/S/})

Does not start with a fixed string and therefore cannot efficiently use an index. With the use of explain, the impact of a regex query of this kind can be investigated e.g.:

> db.test.find({Country:/S/}).explain();
{
"cursor" : "BtreeCursor Country_1 multi",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 150000000,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 2,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
    "Country" : [
        [
            "",
            {
                
            }
        ],
        [
            /S/,
            /S/
        ]
    ]
},
"server" : "host.local:27017"
}

Note the value of nscanned which is the total number of documents or index entries scanned during the database operation. The index is used but the whole index needs to be scanned irrespective of the number of matching objects.

A better idea would be to determine which values for this field could match, and search for those explicitly:

> db.test.find({Country:['US']}).explain();
> #                      ^ derive this list
{
"cursor" : "BtreeCursor Country_1 multi",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
    "Country" : [
        [
            "US",
            "US"
        ],
        [
            [
                "US"
            ],
            [
                "US"
            ]
        ]
    ]
},
"server" : "host.local:27017"
}

Note that in this case, the value of nscanned is the same as nscannedObjects - indicating an efficient use of the index and likely a much faster query.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 11 '13 at 09:57

AD7six

63,116
12
91
123

1

To be specific, the caveat for the unanchored $regex is that it cannot use an index **efficiently**. This is an *index scan* not a *collection scan*. Scanning 150 million index entries is still less intensive than scanning 150 million documents (though certainly not ideal). – Stennie Jan 03 '14 at 22:17
@Stennie that caveat is certainly accurate but an unanchored query won't use an index _at all_ unless it's hinted. – AD7six Jan 03 '14 at 22:28
I believe you're mistaken. Have you tested this? I ensured an index on Country and `explain()` on your unanchored regex shows the index used: "BtreeCursor Country_1 multi". I checked in 2.0.9, 2.2.6, and 2.4.8 .. all report the same. `nscanned` (index entries) increases with the size of the collection but `nscannedobjects` does not. – Stennie Jan 03 '14 at 22:45
[You're right](https://gist.github.com/AD7six/82d32cedd2ba1c70ee85) which is awesome - I'm pretty sure that wasn't the case with earlier versions and [the docs](http://docs.mongodb.org/manual/core/query-optimization/) still state `Queries that specify regular expressions [...] cannot use an index with one exception. Queries that specify regular expression with anchors at the beginning of a string can use an index.`. In any case thanks for clarifying. – AD7six Jan 03 '14 at 23:23
Thanks for pointing at the docs reference; will get it clarified. FYI, the [`$regex` command](http://docs.mongodb.org/manual/reference/operator/query/regex/) documentation has a more accurate explanation. Out of curiosity I tried with MongoDB 1.8.5 (which is going back to March 2011) and it still has the same behaviour :). While it is helpful that an index is being used for these queries (at least more helpful than a collection scan with no index), a different data model approach would be definitely be recommendable if someone wanted to run those sort of queries frequently. – Stennie Jan 03 '14 at 23:41
Almost all of my mongo experience (I don't currently use mongo, though I did so extensively in the past) predates that, though I can't seem to get earlier versions to build to confirm at what point that started to happen. There are many similar answers on SO, some recent such as [this one](http://stackoverflow.com/a/17502177/761202); it would be useful to identify when the query optimizer started to use indexes with unanchored regexes and clarify that it's nolonger a current (or even recent) behavior. I'll update the answer accordingly =). – AD7six Jan 04 '14 at 15:07

Fast NOSQL queries when dealing with large datasets

3 Answers3

Inefficient query