-1

I want to get content of canonical link from page. The code is in Node.js on server (without DOMs). I have complete body of response (downloaded page) and following code:

var metaRegex = new RegExp(/<link.*?href=['"](.*?)['"].*?rel=['"]canonical['"].*?>/i);
// return correctly: https://support.google.com/recaptcha/?hl=en
// var metaRegex = new RegExp(/<link(?=.*rel=['"]canonical['"])(?=.*href=['"](.*?)['"]).*?>/i);
// return incorrectly: https://www.google.com/accounts/TOS
var metaTag = metaRegex.exec(body);
console.log(metaTag[1]);

JsFiddle.

In the first expression is problem with order of rel and href attributes. It takes only:

<link href="https://support.google.com/recaptcha/?hl=en" rel="canonical">

and NOT

<link rel="canonical" href="https://support.google.com/recaptcha/?hl=en">

The second expression takes both ordering, but it match the last occurrence of href.

It looks like if I should require existence of both attributes and may group it?

What is the correct way?

MakoBuk
  • 464
  • 2
  • 8
  • 18
  • 2
    The correct way is to not use regex on HTML. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Adrian Wragg Oct 27 '16 at 12:05
  • More usefully, use an HTML parser instead. http://stackoverflow.com/questions/7977945/html-parser-on-node-js – Adrian Wragg Oct 27 '16 at 12:06
  • @AdrianWragg I don't agree with you. Regex is useful for each parsing of string. My question is not how to do that in different way, I asked how to retreive the correct part of string. The DOM parser is too slow to use it in my case. – MakoBuk Oct 27 '16 at 12:08
  • Actually, no, regex is not "useful for parsing". It can't parse any but the simplest of grammars. It is "useful for **matching**". –  Oct 27 '16 at 12:55
  • @torazaburo Just quibble. – MakoBuk Oct 27 '16 at 13:01
  • Actually, it's far more than a quibble. There is an entire body of knowledge about languages and grammars and parsing them, and it is an indisputable fact that regexp is not suited to parsing languages, of which HTML is very definitely one. –  Oct 27 '16 at 15:36

1 Answers1

1

Just use two sequential RegExps, like that:

var body = '<link rel="stylesheet" href="my.css"/> <link href="https://support.google.com/recaptcha/?hl=en" rel="canonical"/> <a href="https://www.google.com/accounts/TOS"/>'
var linkRegexp = /(<link[^>]*rel=['"]canonical['"][^>]*>)/;
var hrefRegexp = /href=['"](.*?)['"]/;

var linkBody = linkRegexp.exec(body)[1];
console.log(hrefRegexp.exec(linkBody)[1]);
  • linkRegexp - get the link with rel='canonical'
  • hrefRegexp - extract href from it

If you want just one regexp, you can try to use the alternative groups, and choose the non-empty match, like this:

var regexp = /<link[^>]*(?=href=['"]([^'"]*)['"][^>]*?rel=['"]canonical['"]|rel=['"]canonical[^>]*?href=['"]([^'"]*)['"])[^>]*>/;
console.log( regexp.exec(body).splice(1).join(""));

(but IMHO this is much less readable)

zeppelin
  • 8,947
  • 2
  • 24
  • 30