Difference between `([^\<]+)<\/a>` and `(`?

Question

I'm trying not to capture anchor tags, so i used this

(?!([^\<]+)<\/a>)(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])

which excludes the anchor tag and selects the urls which are not present inside anchor tags but it fails for this case:

<a href="www.google.com"> <b> Google Homepage </b> </a>
because of the left angular brace.

so I thought of using this

(?!(<a.+)<\/a>)(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])
but this isn't working ?

Anybody please explain why this is not working and what can be the possible solution to my problem.

Hope I explained the question, thanks in advance for helping.

"what can be the possible solution to my problem" Don't use REGEX for something it's not suited for. Get a parser library for HTML and use that. — nvoigt, Jul 03 '18 at 15:06
Please refer to [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) answer that sums up all your problems with this approach, current and future. — nvoigt, Jul 03 '18 at 15:08

score 1 · Answer 1 · answered Jul 03 '18 at 16:38

Never use Regex to parse html. Just don't. There are too many different complication, and using something like htmlparser is just way easier. This link should help you decide: https://tomassetti.me/parsing-html/ If you don't want to go to the link, here is the gist of the different parsers:

Java

Lagarto and Jerry
HtmlCleaner
Jsoup

C#

AngleSharp
HtmlAgilityPack

Python

HTML Parser of The Standard Library
Html5lib
Html5-parser
Lxml
AdvancedHTMLParser
Beautiful Soup

JavaScript

Browser
- jQuery
- DOMParser
Node.js
- Cheerio
- Jsdom
- Htmlparser2
- Parse5

Nope, I'm not parsing the HTML it's just that I have a text input for which whenever a user enters a URL i want to convert it to clickable link and now the user thought of giving a name to those link using anchor tag, apart from this all other things are simple text. — Sandeep Gupta, Jul 04 '18 at 04:53

score 0 · Answer 2 · answered Jul 03 '18 at 13:27

0

try to use this:

(a\shref=".+"|\/?b|Google Homepage|\/?a)

answered Jul 03 '18 at 13:27

cristian leonardi

3
3

(a\shref=([a-z]|\.|\?|\"|)+|\/?b|([a-zA-Z]+)|\/?a) – cristian leonardi Jul 03 '18 at 13:27
can there be a general solution, this is specifically for tag, there is a possibility of any other tag also. something by which I can say not select inside anchor tag (whatever there is, inside the tag). btw thanks for helping. – Sandeep Gupta Jul 03 '18 at 13:31
Oh, now I'm understand. Try this also if you want: ' > ([A-Za-z0-9 \. \? \! \ S] +) < ' – cristian leonardi Jul 03 '18 at 14:04
1

Please don't advise people to use Regex to parse html. That's a trap. It might work for their very small, specific problem here, but when that is solved they will just stumble into the next and then the next and then the next. Suggest a proper solution instead (or at least on top). – nvoigt Jul 03 '18 at 15:11
I made a different answer if that helps @nvoigt – Sheshank S. Jul 03 '18 at 16:39

Difference between `([^\<]+)<\/a>` and `(`?

2 Answers2