How to concatenate all values and find specific substring in Mongodb?

Question

I have json document like this:

{
  "A": [
    {
      "C": "abc",
      "D": "de"
    },
    {
      "C": "fg",
      "D": "hi"
    }
  ]
}

I would check whether "A" contains string ef or not. first Concatenate all values abcdefghi then search for ef

In XML, XPATH it would be something like: //A[contains(., 'ef')]

Is there any similar query in Mongodb?

It would be extremely inefficient to do so both to construct the string, but also perform the search. Can you use a full text search engine instead? — WiredPrairie, May 08 '14 at 19:00
Full text search cannot be use as the result will be different. I also feel that there is no easy way to find solution of this, most important thing is that I don't know the depth, neither which child will be there. It is quite easy in case of XPATH as in example above. — Prakash Thapa, May 10 '14 at 15:16
MongoDB isn't a good fit for this with your data structured as you've presented it. I don't know what you mean by "full text cannot be used as the result will be different." Have you tried a full text search engine? It would be far more suited to this task. — WiredPrairie, May 10 '14 at 17:02

score 1 · Answer 1 · answered May 09 '14 at 01:28

All options are pretty horrible for this type of search, but there are a few approaches you can take. Please note though that the end case here is likely the best solution, but I present the options in order to illustrate the problem.

If your keys in the array "A" are consistently defined and always contained an array, you would be searching like this:

db.collection.aggregate([
    // Filter the documents containing your parts
    { "$match": {
        "$and": [
            { "$or": [
                { "A.C": /e/ },
                { "A.D": /e/ } 
            ]},
            {"$or": [
                { "A.C": /f/ },
                { "A.D": /f/ }
            ]}
        ]
    }},

    // Keep the original form and a copy of the array
    { "$project": { 
        "_id": { 
            "_id": "$_id", 
            "A": "$A" 
        },
        "A": 1 
    }},

    // Unwind the array
    { "$unwind": "$A" },

    // Join the two fields and push to a single array
    { "$group": {
         "_id": "$_id",
         "joined": { "$push": {
             "$concat": [ "$A.C", "$A.D" ]
         }}
    }},

    // Copy the array
    { "$project": {
        "C": "$joined",
        "D": "$joined"
    }},

    // Unwind both arrays
    { "$unwind": "$C" },
    { "$unwind": "$D" },

    // Join the copies and test if they are the same
    { "$project": {
        "joined": { "$concat": [ "$C", "$D" ] },
        "same": { "$eq": [ "$C", "$D" ] },
    }},

    // Discard the "same" elements and search for the required string
    { "$match": {
        "same": false,
        "joined": { "$regex": "ef" }
    }},

    // Project the origial form of the matching documents
    { "$project": {
        "_id": "$_id._id",
        "A": "$_id.A"
    }}
])

So apart from the horrible $regex matching there are a few hoops to go through in order to get the fields "joined" in order to again search for the string in sequence. Also note the reverse joining that is possible here that could possibly produce a false positive. Currently there would be no simple way to avoid that reverse join or otherwise filter it, so there is that to consider.

Another approach is to basically run everything through arbitrary JavaScript. The mapReduce method can be your vehicle for this. Here you can be a bit looser with the types of data that can be contained in "A" and try to tie in some more conditional matching to attempt to reduce the set of documents you are working on:

db.collection.mapReduce(
    function () {

      var joined = "";

      if ( Object.prototype.toString.call( this.A ) === '[object Array]' ) {
        this.A.forEach(function(doc) {
          for ( var k in doc ) {
            joined += doc[k];
          }
        });
      } else {
        joined = this.A;  // presuming this is just a string
      }

      var id = this._id;
      delete this["_id"];

      if ( joined.match(/ef/) )
        emit( id, this  );

    },
    function(){},    // will not reduce
    { 
        "query": {
            "$or": [
                { "A": /ef/ },
                { "$and": [
                    { "$or": [
                        { "A.C": /e/ },
                        { "A.D": /e/ } 
                    ]},
                    {"$or": [
                        { "A.C": /f/ },
                        { "A.D": /f/ }
                    ]}
                ] }
            ]
        },
        "out": { "inline": 1 }
    }
);

So you can use that with whatever arbitrary logic to search the contained objects. This one just differentiates between "arrays" and presumes otherwise a string, allowing the additional part of the query to just search for the matching "string" element first, and which is a "short circuit" evaluation.

But really at the end of the day, the best approach is to simply have the data present in your document, and you would have to maintain this yourself as you update the document contents:

{
  "A": [
    {
      "C": "abc",
      "D": "de"
    },
    {
      "C": "fg",
      "D": "hi"
    }
  ],
  "search": "abcdefghi"
}

So that is still going to invoke a horrible usage of $regex type queries but at least this avoids ( or rather shifts to writing the document ) the overhead of "joining" the elements in order to effect the search for your desired string.

Where this eventually leads is that a "full blown" text search solution, and that means an external one at this time as opposed to the text search facilities in MongoDB, is probably going to be your best performance option.

Either using the "pre-stored" approach in creating your "joined" field or otherwise where supported ( Solr is one solution that can do this ) have a "computed field" in this text index that is created when indexing document content.

At any rate, those are the approaches and the general point of the problem. This is not XPath searching, not is their some "XPath like" view of an entire collection in this sense, so you are best suited to structuring your data towards the methods that are going to give you the best performance.

With all of that said, your sample here is a fairly contrived example, and if you had an actual use case for something "like" this, then that actual case may make a very interesting question indeed. Actual cases generally have different solutions than the contrived ones. But now you have something to consider.

I am very appreciated for your answered. The most important thing is that, I don't know the depth of `A` neither name of children are regular. I am trying get all (only)values like in xml, but I knew that there is no easy way. This is the question I posted before: http://stackoverflow.com/questions/23240961/how-to-use-full-text-search-for-unknown-number-of-children-of-a-field-in-mongodb — Prakash Thapa, May 10 '14 at 15:28
@PThapa No one who has pointed to a text search engine here is talking about MongoDB text search. Have a look at external text search solutions as has actually been suggested. You can adapt mapReduce processes to do this sort of operation at an unkown depth with some adaption of recursion, a basic concept is [here](http://stackoverflow.com/questions/23412243/how-to-query-a-relative-element-using-mongodb/23419449#23419449). This is not how you presented your question nor does it change the basic answer. "Pre-concatenate" your data or use an external source to facilitate the search. — Neil Lunn, May 11 '14 at 00:38

How to concatenate all values and find specific substring in Mongodb?

1 Answers1