-2

I need to catch catch all links from multiple websites. For that I have gathered the entire html file. I need a regular expression that puts all of them in an array.

I dont want to collect any image files or other code files. Just the html from the pages themselves.


I want It to collect all links like this:

/https://www.hello.com
/https://www.hello.com/index.php
/https://www.hello.com/world
/https://www.hello.com/world.php
/https://www.hello.com/world.html
/https://hello.com
/https://hello.com/world
/http://www.hello.com
/http://www.hello.com/world
/http://hello.com
/http://hello.com/world
/www.hello.com
/www.hello.com/world
/hello.com
/hello.com/world
/hello
/hello/world

But not like this:

hello 
hello/world
hello.png
hello.zip
/hello/world.png
/hello/world.js

What regular expression would I need for this? Or is there a better way? (maybe by collecting a's)

bmols
  • 1
  • 2
  • Why the downvote? Seems like a legit question – Alicia Sykes May 22 '17 at 07:30
  • "is there a better way?": Well, a regex can't do this totally robust (by the nature of the HTML language). But the alternative would be to use an HTML/XML parser, which might be totally overkill for your simple task. So I'd go for regex. – leemes May 22 '17 at 07:45

2 Answers2

0

I guess you define "link" as hyperlinks in the form of <a href="...">. The following regex (already in the form of a PHP string) should be a good start*:

'<\\s*a\\s*[^>]*href\\s*=\\s*"([^"]+)"'

Test this regex

When using this with preg_match($regex, $html, $match), the $match[1] gives you the link, however, it is in an encoded form (it might contain html entities). To remove those, use html_entity_decode.

$link = html_entity_decode($match[1]);

You should also exclude links which are just fragments of the same site, that are links starting with the hash symbol: $link[0] == '#'


*This regex is not conform to the definition of the HTML language (I think this is impossible to do 100% correctly). The regex for example fails for links where the attribute is not wrapped in double quotes (they might be unquoted or quoted in single quotes).

leemes
  • 44,967
  • 21
  • 135
  • 183
0

Something like PHPQuery may be preferable to using a regex in this case. See this answer for an explanation of why.

Community
  • 1
  • 1
MikeRalphson
  • 2,247
  • 2
  • 15
  • 16