I need to catch catch all links from multiple websites. For that I have gathered the entire html file. I need a regular expression that puts all of them in an array.
I dont want to collect any image files or other code files. Just the html from the pages themselves.
I want It to collect all links like this:
/https://www.hello.com
/https://www.hello.com/index.php
/https://www.hello.com/world
/https://www.hello.com/world.php
/https://www.hello.com/world.html
/https://hello.com
/https://hello.com/world
/http://www.hello.com
/http://www.hello.com/world
/http://hello.com
/http://hello.com/world
/www.hello.com
/www.hello.com/world
/hello.com
/hello.com/world
/hello
/hello/world
But not like this:
hello
hello/world
hello.png
hello.zip
/hello/world.png
/hello/world.js
What regular expression would I need for this? Or is there a better way? (maybe by collecting a's)