0

I am trying to parse HTML to find URLs in the posts. Actually most of the times it works, but in one case it does not parse. I need to parse all the links present in the post. Link format varies as follows:-

google.com
google.com/q=love
google.com/in-love/1212/a
www.google.com/in-love/1212/a
www.google.com/q=love
www.google.com
http://www.google.com/in-love/1212/a
http://google.com
http://www.google.com
http://google.com/q=love
https://www.google.com/in-love/1212/a
https://google.com
https://www.google.com
https://google.com/q=love

but in some cases my regex parses these too:-

tanmoy.kundu
i.e

I am using this regex to parse the HTML post:

/\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.|-)+[\.][^\## Heading ##W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/g

I need a valid domain checking parsing. Like .com, .uk etc

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Well there can be anything after a `.` there is no pattern for that. the extension can be anything these days. If you want to limit your extensions then you need to manually check them not using a pattern. – Harry Bomrah Dec 09 '15 at 09:56
  • I hate to be negative here, but this is never going to be a complete solution. First off, take a look at why you should not try to use regex alone to parse HTML [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Then take a look at just how complex a URL can be at [RFC]( https://tools.ietf.org/html/rfc3986). There are always going to be corner cases you miss with RE – N. Leavy Dec 09 '15 at 10:06
  • [This is a good related article](http://www.regular-expressions.info/email.html) - while it is talking about matching emails, it has a good discussion about domain names. – James Thorpe Dec 09 '15 at 10:39

2 Answers2

1

This Regx is helpful for my case

/(((?:ht|f)tp[s]?:[\/]{2})?(?:\w+(?::\w+)?@)?(?:(?:(?:\d{1,3}\.){3}\d{1,3})|(?:(?:\w|\d|\.|\$|_|@|\+|\-)*(?:\w|\d|\$|_|@|\+|\-)\.(?:aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))(?!\w))(?::\d{1,5})?(?:\/+(?:\w|\d|\.|\=|\$|_|@|\+|\-|~)*(?:\w|\d|\$|_|@|\+|\-|~))*\/*(?:\?(?:\w|\d|\.|\$|_|@|\+|\-|&|=)*)?)/g

Thanks :-)

0

Regex exist for check to enable the largest possible number of cases with a same rule.

Now, with the case of one validation of URL, it's verry difficult to check all URLs with one REGEX because the new gTLD (list of all GTLD and "old" extensions are here => https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains) are more longer, many website have a subDomain... etc...

For me, the best regex pattern should be test the extension (for know if the URL can be really exist... or not) I know this website => https://mathiasbynens.be/demo/url-regex for get many REGEX PATTERN for checked specif URL.

In your case,

i.e
tanmoy.kundu

If the regex checked if your extension is valid, ('e' and 'kundu' are not a valid exentsions) your regex works :p

And, don't forget you can test your regex with http://www.regexpal.com ^_^ it's easy.

Doc Roms
  • 3,288
  • 1
  • 20
  • 37