I am building a web crawler in PHP, meant for Intranet use (we're dealing with a huge Intranet). I managed to download a web page using the cURL functions, but now I want to scan the content for links. I am trying to find all obvious links and split them in their corresponding scheme/authority/path/query/fragment so I can index them properly.
Is there a known regular expression that matches all the links, including the ones like <img src="../images/header/logo.png" />
, background-image: url(..)
and <a href="?query#lonely-fragment">
.
What are all the plain-text link representations that I can find using regular expressions in PHP?