0

I am building a web crawler in PHP, meant for Intranet use (we're dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.

Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />, background-image: url(..) and <a href="?query#lonely-fragment">.

What are all the plain-text link representations that I can find using regular expressions in PHP?

cdhowie
  • 158,093
  • 24
  • 286
  • 300
f.ardelian
  • 6,716
  • 8
  • 36
  • 53

1 Answers1

3

You will be better off parsing documents using a proper HTML parser. Regex is not really suited for this kind of thing.

Once you have done that, it's fairly trivial using XPath to scan for e.g. //img/@src or //a/@href to find all of the content links in the document itself.

If you want to scan CSS, you will also need to look for //style[@type='text/css'] and //link[@rel='stylesheet'][@type='text/css']/@href and then use a proper CSS parser to extract all of the content. (Or, if you want to be lazy, you could probably get away with the regex /url\((.*?)\)/.)

cdhowie
  • 158,093
  • 24
  • 286
  • 300