Is it possible to ignore all whitespace using regex in MongoDB queries?
My Node.js program uses Cheerio to pull data from a number of websites, parses and then stores the data in MongoDB. My database has a People
collection that keys on the string field Name
.
Problem occurs where one website (site-A) shows the name HTML text as John&npsp;Smith
, whereas another website (site-B) shows name as John Smith
. My program has two scripts, one that scrapes site-A and another to scrape site-B; both of which use the following to scrape the Name
data -
var $ = cheerio.load(htmlrow);
var personobj = { name: $('td.person a').text().trim() }
Each script then uses the following MongoDb command (using the native driver) to upsert the scraped data, keying on the Name
field. However, this results in two records in the People
collection -
db.collection('people').update(
{ Name: personobj.name },
{ $set: { LastScan: new Date() }},
{ upsert: true },
function(){} );
Now, I tried using the regex "extended" 'x' option to query in MongoDb, but it's not working. In fact, I tried testing the 'x' option via the find
operator in Robomongo, and it returns zero records. I also note that when find
testing in Robomongo, and I simply type Name: "John Smith"
, it only returns the site-B record, the one without the $nbsp; whitespace; even though when I view the detail of both records, the name strings appear identical. (I suppose difference is caused somewhere by all the encoding/decoding going on here to scrape, parse, store, retrieve... but I'm not sure where or why).
Is it possible to ignore all whitespace when querying MongoDb using regex?
Or, is it easier to handle this in my javascript parse line, to somehow replace and 'standardize' all possible whitespace characters? (Any recommended library to do so?)