Extracting a list of substrings from MongoDB using a Regular Expression

Question

I need to extract a part of a string that matches a regex and return it.

I have a set of documents such as:

{"_id" :12121, "fileName" : "apple.doc"}, 
{"_id" :12125, "fileName" : "rap.txt"},
{"_id" :12126, "fileName" : "tap.pdf"}, 
{"_id" :12126, "fileName" : "cricket.txt"},

I need to extract all file extensions and return {".doc", ".txt", ".pdf"}.

I am trying to use the $regex operator to find the sub strings and aggregate on the results but am unable to extract the required part and pass it down the pipeline.

I have tried something like this without success:

aggregate([
  { $match: { "name": { $regex: '/\.[0-9a-z]+$/i', "$options": "i" } } },
  { $group: { _id: null, tot: { $push: "$name" } } }
])

And what have you tried? What problems have you encountered? — Phu Ngo, Sep 01 '16 at 13:17
Thanks guys. I have updated my question. I am using MongoChef to write the query. The problem is how do I extract the extension ? Finding names that contain the substring is not a problem — Macky, Sep 01 '16 at 13:19
Check again the [docs](https://docs.mongodb.com/manual/reference/operator/query/regex/#op._S_regex) for the syntax. Note specifically the `'`, `/` and `$options` — Phu Ngo, Sep 01 '16 at 13:58

score 3 · Answer 1 · answered Sep 01 '16 at 13:40

3

It's almost undoable to do it in the aggregation pipe, you want to project your matches and include only the part after the period. There is no (yet) operator to locate the position of the period. You need the position because $substr (https://docs.mongodb.com/manual/reference/operator/aggregation/substr/) requires a start position. In addition $regEx is only for matching, you cannot use it in a projection to replace.

I think for now it's a easier to do it in code. here you could use a replace regex or any other solution provided by your language

answered Sep 01 '16 at 13:40

HoefMeistert

1,190
8
17

I believe you understand my problem. Would you have a Java example available ? – Macky Sep 01 '16 at 13:43
You can do something like : var anyString4 = anyString.substring(anyString.length - 4); See here for more examples: https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/String/substring You can also look for the position of the last period and use that as the start position. – HoefMeistert Sep 01 '16 at 13:46
OK thanks. I think I can get that part in the Java code but want to do it in the database as there are a million records and I can't run through them every time I need to get the extensions – Macky Sep 01 '16 at 13:51
1

Then you should update your document to include the extension and set it on insert. You should be able to do it automaticly (a class property with a substr of your current filename property). then create a minor script/program to update the current documents in the db. Make a simple aggregation with a group on extenion and your good to go :) – HoefMeistert Sep 01 '16 at 13:55
Thanks for the suggestions. I'll give it a think :-) I am still hopeful that I can extract the extension in the database layer. As you said, I need to find the location of the dot to pass it to the substring. – Macky Sep 01 '16 at 14:13

Xavier Guihot · Answer 2 · 2020-01-21T22:10:04.257

Starting Mongo 4.2, the $regexFind aggregation operator makes things easier:

// { _id : 12121, fileName: "apple.doc" }
// { _id : 12125, fileName: "rap.txt" }
// { _id : 12126, fileName: "tap.pdf" }
// { _id : 12127, fileName: "cricket.txt" }
// { _id : 12129, fileName: "oops" }
db.collection.aggregate([
  { $set: { ext: { $regexFind: { input: "$fileName", regex: /\.\w+$/ } } } },
  { $group: { _id: null, extensions: { $addToSet: "$ext.match" } } }
])
// { _id: null, extensions: [ ".doc", ".pdf", ".txt" ] }

This makes use of:

The $set operator, which adds a new field to each the documents.
This new field (ext) is the result of the $regexFind operator, which captures the result of a matching regex. If a match is found, it returns a document that contains information on the first match. If a match is not found, returns null. For instance:
- For { fileName: "tap.pdf" }, it produces { matches: { match: ".pdf", idx: 3, captures: [] }.
- For { fileName: "oops" }, it produces { matches: null }.
Finally, using a $group stage, coupled with $addToSet on the match subfield, we can generate the list of distinct extensions.

score 2 · Accepted Answer · edited Feb 21 '23 at 10:35

It will be possible to do this in the upcoming version of MongoDB(as the time of this writing) using the aggregation framework and the $indexOfCP operator. Until then, your best bet here is MapReduce.

var mapper = function() { 
    emit(this._id, this.fileName.substring(this.fileName.indexOf(".")))
};

db.coll.mapReduce(mapper, 
                  function(key, value) {}, 
                  { "out": { "inline": 1 }}
)["results"]

Which yields:

[
    {
        "_id" : 12121,
        "value" : ".doc"
    },
    {
        "_id" : 12125,
        "value" : ".txt"
    },
    {
        "_id" : 12126,
        "value" : ".pdf"
    },
    {
        "_id" : 12127,
        "value" : ".txt"
    }
]

For completeness here is the solution using the aggregation framework^*

db.coll.aggregate(
    [
        { "$match": { "name": /\.[0-9a-z]+$/i } },
        { "$group": { 
            "_id": null,
            "extension":  { 
                "$push": {
                    "$substr": [ 
                        "$fileName", 
                        { "$indexOfCP": [ "$fileName", "." ] }, 
                        -1 
                    ]
                }
            }
        }}
    ])

which produces:

{ 
    "_id" : null, 
    "extensions" : [ ".doc", ".txt", ".pdf", ".txt" ] 
}

_{*current development version of MongoDB (as the time of this writing).}

Extracting a list of substrings from MongoDB using a Regular Expression

3 Answers3

Linked