I want to get content of canonical link from page. The code is in Node.js on server (without DOMs). I have complete body of response (downloaded page) and following code:
var metaRegex = new RegExp(/<link.*?href=['"](.*?)['"].*?rel=['"]canonical['"].*?>/i);
// return correctly: https://support.google.com/recaptcha/?hl=en
// var metaRegex = new RegExp(/<link(?=.*rel=['"]canonical['"])(?=.*href=['"](.*?)['"]).*?>/i);
// return incorrectly: https://www.google.com/accounts/TOS
var metaTag = metaRegex.exec(body);
console.log(metaTag[1]);
In the first expression is problem with order of rel and href attributes. It takes only:
<link href="https://support.google.com/recaptcha/?hl=en" rel="canonical">
and NOT
<link rel="canonical" href="https://support.google.com/recaptcha/?hl=en">
The second expression takes both ordering, but it match the last occurrence of href.
It looks like if I should require existence of both attributes and may group it?
What is the correct way?