5

I have the following Mongodb database structure:

{ 
    "_id" : "519817e508a16b447c00020e", 
    "keyword" : "Just an example query", 
    "rankings" : 
    {
        results:
        {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/"},
            "2" : { "domain" : "example2.com", "href" : "http://www.example2.com/"},
            "3" : { "domain" : "example3.com", "href" : "http://www.example3.com/"},
            "4" : { "domain" : "example4.com", "href" : "http://www.example4.com/"},
            "5" : { "domain" : "example5.com", "href" : "http://www.example5.com/"},
            ...
            ...
            "99" : { "domain" : "example99.com", "href" : "http://www.example99.com/"}
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"}
        }, 
        "plus":"many", 
        "other":"not", 
        "interesting" : "stuff", 
        "for": "this question"
    }
}

In a previous question, I asked how to index the text so that I could search for the keyword and domain using for example:

db.ranking.find({ $text: { $search: "\"example9.com\" \"Just an example query\""}})  

The awesome answer by John Petrone was:

db.ranking.ensureIndex(
{
    "keyword": "text",
    "rankings.results.1.domain" : "text",
    "rankings.results.2.domain" : "text",
    ...
    ...
    "rankings.results.99.domain" : "text",
    "rankings.results.100.domain" : "text"
}

However, if that works just great when I have 10 results, I run into an "Index key pattern too large" error with code 67 from Mongo shell when I try to index 100 results.

So the big question is:

How (the hell) can I resolve that "index key pattern too large" error?


EDIT: 18/08/2014 The document structure clarified

{ 
    "_id" : "519817e508a16b447c00020e", #From Mongodb
    "keyword" : "Just an example query", 
    "date" : "2014-03-28"
    "rankings" :
    {
            "1" : { "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1"},
            ...
            "100" : {"domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100"}"}
    }, 
    "plus":"many", 
    "other":"not", 
    "interesting" : "stuff", 
    "for": "this question"
}
Community
  • 1
  • 1
Antoine Brunel
  • 1,065
  • 2
  • 14
  • 30
  • 1
    It ocurred to me when I saw the original question that this would quickly become a problem. Is there no way you can change the structure? – John Powell Aug 16 '14 at 06:52
  • Well, that's a json received from an external API. But, if I changed the structure, what would you recommend and how could I setup the index? – Antoine Brunel Aug 16 '14 at 10:13
  • 1
    Do you want to reference both domain and href in your queries? With equal weight? Presumably the _id comes from each set of searches, so you could have multiple documents each with a difference _id and an array of results, with href and domain inside? I just want to be clear, before offering an answer. – John Powell Aug 17 '14 at 10:59
  • @JohnBarça The _id comes from by default from Mongodb, not from the original json. Yes, I'd rather reference both keyword, href and domain because sometimes I may query the keyword, domain or the href! I was thinking in something like this: `"{keyword" : "Just an example query", "rankings" : [{"rank" : 1, "domain" : "example1.com", "href" : "http://www.example1.com/"}, … { "rank" : 99, "domain" : "example99.com", "href" : "http://www.example99.com/“}]} Thanks! – Antoine Brunel Aug 18 '14 at 15:45
  • 1
    What is the use case? I feel like a different data model is the best way to proceed here. – wdberkeley Aug 18 '14 at 15:52
  • I just clarified the document structure in an edit in the lower part of the post. @wdberkeley Use Cases 1: Search for keyword : "just an Example" with domain : "example1.com" Use Cases 2: Search for keyword : "just an Example" with domain : "example1.com" for rank_date : "2014-03-28" Use Cases 3: Search for keyword : "just an Example" with href : "http://www.example1.com/" Use Cases 4: Search for keyword : "just an Example" with href : "http://www.example1.com/" for rank_date : "2014-03-28" AND rank_date : "2014-03-21" Thanks! – Antoine Brunel Aug 18 '14 at 16:42
  • 1
    Exact matches as in your examples? Or keyword/text search matches? Text search on `domain` and `href` fields or just the `keyword` field? – wdberkeley Aug 18 '14 at 18:30
  • @wdberkeley Exact match for the domain, partial match for href and keyword. Thanks! – Antoine Brunel Aug 18 '14 at 19:22
  • But I'm also super interested in understanding what would be the best option according to the different case! – Antoine Brunel Aug 18 '14 at 23:43
  • I was actually thinking of spliting the structure into an array as described in the comment above, then index it with db.all.ensureIndex( {"rankings.href" : "text"} ), but @JohnBarça came up with an interesting idea about referencing instead of embedding! – Antoine Brunel Aug 19 '14 at 15:09

2 Answers2

1

The problem with your suggested structure:

{
 keyword" : "Just an example query", 
 "rankings" :
    [{"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
     ...{ "rank" : 99, "domain" : "example99.com", "href" : "example99.com“}
 ]}
}

Is that although you can now do

db.ranking.ensureIndex({"rankings.href":"text", "rankings.domain":"text"}) 

and then run queries like:

db.ranking.find({$text:{$search:"example1"}});

this will now return the whole array document where the array element is matched.

You might want to consider referencing so that each rankings result is a separate document and the keywords and other meta data are referenced, to avoid repeating lots of information.

So, you have a keyword/metadata document like:

{_id:1, "keyword":"example query", "querydate": date, "other stuff":"other meta data"},
{_id:2, "keyword":"example query 2", "querydate": date, "other stuff":"other meta data 2"}

and then a results document like:

{keyword_id:1, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
... keyword_id:1, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"},
 keyword_id:2, {"rank" : 1, "domain" : "example1.com", "href" : "example1.com"},
 ...keyword_id:2, {"rank" : 99, "domain" : "example99.com", "href" : "example99.com"}}

where keyword_id links back to (references) the keyword/metadata table -- obviously, in practice, the _ids will look like "_id" : "519817e508a16b447c00020e", but this is just for readability. You could now index on keyword_id, domain and href, either together or separately, depending on your query types and you will not get the index key pattern too large error and you will only get a single matching document rather than a whole array returned.

I am not entirely clear on where you are needing fuzzy/regex style searches and whether you will be searching metadata or just href and domain, but I think this structure should be a cleaner way to start thinking about indexing, without maxing out on indexes, as before. It will also allow you to combine finds on normal indexes with text indexes, depending on your query pattern.

You might find this answer MongoDB relationships: embed or reference? useful when considering you document struture.

Community
  • 1
  • 1
John Powell
  • 12,253
  • 6
  • 59
  • 67
  • Thanks, it's a very interesting answer (a pity I can't vote yet). The reference article you posted is great and I actually did not consider splitting. But why not an even flatter structure for results document, like: `{keyword_id:1, "rank" : 1, "domain" : "example1-1.com", "href" : "example1-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example1-99.com", "href" : "example1-99.com"}, {keyword_id:2, "rank" : 1, "domain" : "example2-1.com", "href" : "example2-1.com"} ... {keyword_id:1, "rank" : 99, "domain" : "example2-99.com", "href" : "example2-99.com"}}` – Antoine Brunel Aug 19 '14 at 14:55
  • And this is a very interesting article too http://seanhess.github.io/2012/02/01/mongodb_relational.html – Antoine Brunel Aug 19 '14 at 15:06
  • Yes, you could go for a totally flat structure like that. I was thinking you might want to keep meta data separately around each set of requests to the api, but the flat structure will certainly work and be easy to index and query. – John Powell Aug 19 '14 at 16:41
  • Very good article. More or less what I was hinting at with the comment about querying arrays – John Powell Aug 19 '14 at 17:21
1

So, that's my solution: I decided to stick with the embedded document with an overly simple modification: Replacing dictionary keys containing the actual rank by an array containing the rank and that's it:

{ 
  "_id" : "519817e508a16b447c00020e", #From Mongodb
  "keyword" : "Just an example query", 
  "date" : "2014-03-28"
  "rankings" :
  [
    { 
      "domain" : "example1.com", "href" : "http://www.example1.com/", "plus" : "stuff1", "rank" : 1
    },
    ...
    {
      "domain" : "example100.com", "href" : "http://www.example100.com/"plus" : "stuff100", "rank" : 100
    }
  ]
  "plus":"many", 
  "more":"uninteresting", 
  "stuff" : "for", 
  "this": "question"
}

Then, I can select an entire document using for example:

> db.ranking.find({"keyword":"how are you doing", "rank_date" : "2014-08-27”)

Or a single result by using projections which is just awesome and a new feature in Mongodb 2.6 :-D

> db.collection.find({ "rank_date" : "2014-04-09", "rankings.href": "http://www.example100.com/" }, { "rankings.$": 1 })

  [
    { 
      "domain" : "example100.com", "href" : "http://www.example100.com/", "plus" : "stuff100", "rank" : 100
    },
  ]

And even get one single url rank directly:

> db.collection.find({"rank_date" : "2014-04-09", "rankings.href": "http://www.example5.com/"}, { "rankings.$": 1 })[0]['rankings'][0]['rank']
5

And finally, I'm also creating an index based on the url:

> db.collection.ensureIndex( {"rankings.href" : "text"} )

With the index, I can either search for a single url, a partial url, a subdomain or the entire domain so that's just great:

> db.collection.find({ $text: { $search: "example5.com"}})

And that's it really! Thanks a lot for everyone's help, especially @JohnBarça :-D

Antoine Brunel
  • 1,065
  • 2
  • 14
  • 30