I have a program I'm developing in NodeJS that utilises request
and cheerio
packages to do some scraping for a research project. Part of the data that is scraped is news article titles. When scraping some of these titles, extended special characters (like —, a big dash) are being read as ?—?
in the webpage. This is how the I'm fetching the pages and loading it into cheerio. The question marks exist both in the pure html response and the cheerio object.
function aRequest(url){
return new Promise((res, rej)=>{
request({
url: url,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
}
}, (err, resp, html)=>{
if(!err){
res(cheerio.load(html));
} else {
rej(err);
}
});
});
}
These question-marks surrounding the special character do not exist in the original title, so I'm attempting to remove them (and in the process I end up removing the big dash too, although that isn't really a problem). A lot of the solutions I've tried don't seem to work. Here's some of the methods I've tried, including answers listed in the following SO questions:
Remove all special characters with regexp
The answer listed in the special character removal works to remove the dash, but the question marks still exist. Some code snippets of things I've tried that do not work:
.replace("?—?", " — ");
.replace(/[^\w\s]/gi, " — ");
.replace("?", "");
.replace(/[?]/gi, " ");
.replace("�", ""); // ASCII question mark
// this is the point I started getting desperate to just have it work
.replace(/[^\w\s]/gi, "").replace("??", " — ");
I figure I could probably get the index of where the —
occurs, and remove the characters one index to the left and right of it, although that seems like a last resort kind of thing.
Furthermore, removing even regular question marks from the strings don't seem for work. For example, if I have a title of "This is a title?", while I've been doing all of these replace operations on question marks (like just .replace(/[?]/gi, "");
) it does not remove these question marks either.
Am I missing something here? I have a feeling the question mark is some kind of non-english character instead of an actual question mark, although I'm not sure what it would be.
How can I remove the ?—?
and just replace it with —
?
My Node version is v10.15.0, and I'm using the latest versions of cheerio
and request
available from npm
.
EDIT: I've since found this question, which experienced a similar problem. I tried removing the characters by character code 57399 (which is what that person experienced), but it still did not remove them. Will attempt to identify the char code of the question marks.