I am trying to parse HTML to find URLs in the posts. Actually most of the times it works, but in one case it does not parse. I need to parse all the links present in the post. Link format varies as follows:-
google.com
google.com/q=love
google.com/in-love/1212/a
www.google.com/in-love/1212/a
www.google.com/q=love
www.google.com
http://www.google.com/in-love/1212/a
http://google.com
http://www.google.com
http://google.com/q=love
https://www.google.com/in-love/1212/a
https://google.com
https://www.google.com
https://google.com/q=love
but in some cases my regex parses these too:-
tanmoy.kundu
i.e
I am using this regex to parse the HTML post:
/\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\## Heading ##W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/g
I need a valid domain checking parsing. Like .com
, .uk
etc