How to build a regex to analyze all links on a web page?

Question

I am building a web crawler in PHP, meant for Intranet use (we're dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.

Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />, background-image: url(..) and <a href="?query#lonely-fragment">.

What are all the plain-text link representations that I can find using regular expressions in PHP?

Do not parse HTML with regexps. Use a XML Parser, such as DOMDocument. — Vincent Savard, Nov 12 '10 at 18:15
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — jwueller, Nov 12 '10 at 18:21
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Nov 12 '10 at 19:12

score 3 · Accepted Answer · answered Nov 12 '10 at 18:23

You will be better off parsing documents using a proper HTML parser. Regex is not really suited for this kind of thing.

Once you have done that, it's fairly trivial using XPath to scan for e.g. //img/@src or //a/@href to find all of the content links in the document itself.

If you want to scan CSS, you will also need to look for //style[@type='text/css'] and //link[@rel='stylesheet'][@type='text/css']/@href and then use a proper CSS parser to extract all of the content. (Or, if you want to be lazy, you could probably get away with the regex /url\((.*?)\)/.)

How to build a regex to analyze all links on a web page?

1 Answers1