How to check if a portion of an _id from one collection appears in another

Question

I have a collection where the _id is of the form [message_code]-[language_code] and another where the _id is just [message_code]. What I'd like to do is find all documents from the first collection where the message_code portion of the _id does not appear in the second collection.

Example:

> db.colA.find({})
{ "_id" : "TRM1-EN" }
{ "_id" : "TRM1-ES" }
{ "_id" : "TRM2-EN" }
{ "_id" : "TRM2-ES" }
> db.colB.find({})
{ "_id" : "TRM1" }

I want a query that will return TRM2-EN and TRM-ES from colA. Of course in my live data, there are thousands of records in each collection.

According to this question which is trying to do something similar, we have to save the results from a query against colB and use it in an $in condition in a query against colA. In my case, I need to strip the -[language_code] portion before doing this comparison, but I can't find a way to do so.

If all else fails, I'll just create a new field in colA that contains only the message code, but is there a better way do it?

Edit: Based on Michael's answer, I was able to come up with this solution:

var arr = db.colB.distinct("_id")
var regexs = arr.map(function(elm){
   return new RegExp(elm);
})
var result = db.colA.find({_id : {$nin : regexs}}, {_id : true})

Edit: Upon closer inspection, the above method doesn't work after all. In the end, I just had to add the new field.

What doesn't work? work fine for me. I can add a demo if needed. — styvane, Apr 15 '15 at 06:13
Your method works on the example I posted, but my live data has so many message codes that some of them happen to be sub-strings of others, resulting in too many records being selected. Of course, there's no way to know this just from looking at my original question, so I'll leave your answer accepted and upvoted. — Nate Allen, Apr 15 '15 at 14:24
Ok maybe you should ask another question with sample document if you still have a problem — styvane, Apr 15 '15 at 14:25

score 2 · Accepted Answer · answered Apr 14 '15 at 17:42

2

Disclaimer: This is a little hack it may not end well.

Get distinct _id using collection.distinct method.
Build a regular expression array using Array.prototype.map()

var arr = db.colB.distinct('_id');
arr.map(function(elm, inx, tab) {
    tab[inx] = new RegExp(elm); 
});

db.colA.find({ '_id': { '$nin': arr }})

answered Apr 14 '15 at 17:42

styvane

59,869
19
150
156

2

I wasn't able to get your code to work as-is, but I did get a query that will work. I'll update my question with it. Thanks for pointing me in the right direction! – Nate Allen Apr 14 '15 at 18:23

score 1 · Answer 2 · answered Apr 14 '15 at 16:44

1

I'd add a new field to colA since you can index it and if you have hundreds of thousands of documents in each collection splitting the strings will be painfully slow.

But if you don't want to do that you could make use of the aggregation framework's $substr operator to extract the [message-code] then do a $match on the result.

answered Apr 14 '15 at 16:44

Pio

4,044
11
46
81

I wasn't aware of the aggregation framework, so thanks for pointing that out to me. Unfortunately, the $substr didn't work for me since it requires start and length indexes and not all of my ids are the same length. – Nate Allen Apr 14 '15 at 18:37

How to check if a portion of an _id from one collection appears in another

2 Answers2