How to match strings in MongoDb and ignore any whitespace

Question

Is it possible to ignore all whitespace using regex in MongoDB queries?

My Node.js program uses Cheerio to pull data from a number of websites, parses and then stores the data in MongoDB. My database has a People collection that keys on the string field Name.

Problem occurs where one website (site-A) shows the name HTML text as John&npsp;Smith, whereas another website (site-B) shows name as John Smith. My program has two scripts, one that scrapes site-A and another to scrape site-B; both of which use the following to scrape the Name data -

var $ = cheerio.load(htmlrow);
var personobj = { name: $('td.person a').text().trim() }

Each script then uses the following MongoDb command (using the native driver) to upsert the scraped data, keying on the Name field. However, this results in two records in the People collection -

db.collection('people').update(
    { Name: personobj.name },
    { $set: { LastScan: new Date() }},
    { upsert: true },
    function(){} );

Now, I tried using the regex "extended" 'x' option to query in MongoDb, but it's not working. In fact, I tried testing the 'x' option via the find operator in Robomongo, and it returns zero records. I also note that when find testing in Robomongo, and I simply type Name: "John Smith", it only returns the site-B record, the one without the $nbsp; whitespace; even though when I view the detail of both records, the name strings appear identical. (I suppose difference is caused somewhere by all the encoding/decoding going on here to scrape, parse, store, retrieve... but I'm not sure where or why).

Is it possible to ignore all whitespace when querying MongoDb using regex?

Or, is it easier to handle this in my javascript parse line, to somehow replace and 'standardize' all possible whitespace characters? (Any recommended library to do so?)

I'd definitely recommend you normalize the strings and avoid trying to use regex if possible. Unless the search is anchored (strings "start with") MongoDB won't be able to use an index. I wouldn't expect too many true whitespace characters to be used in a name that was meant for display. — WiredPrairie, Jan 13 '14 at 01:23
@WiredPrairie Agreed. Much easier to handle in the parsing logic -- useful such solution found [here](http://stackoverflow.com/a/1496863/549155) — mjmoody383, Jan 13 '14 at 05:33

How to match strings in MongoDb and ignore any whitespace

0 Answers0