0

I'm trying not to capture anchor tags, so i used this

(?!([^\<]+)<\/a>)(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])

which excludes the anchor tag and selects the urls which are not present inside anchor tags but it fails for this case:

<a href="www.google.com"> <b> Google Homepage </b> </a>
because of the left angular brace.

so I thought of using this

(?!(<a.+)<\/a>)(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])
but this isn't working ?

Anybody please explain why this is not working and what can be the possible solution to my problem.

Hope I explained the question, thanks in advance for helping.

Avinash
  • 1,245
  • 11
  • 19
  • 1
    "what can be the possible solution to my problem" Don't use REGEX for something it's not suited for. Get a parser library for HTML and use that. – nvoigt Jul 03 '18 at 15:06
  • 1
    Please refer to [this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) answer that sums up all your problems with this approach, current and future. – nvoigt Jul 03 '18 at 15:08

2 Answers2

1

Never use Regex to parse html. Just don't. There are too many different complication, and using something like htmlparser is just way easier. This link should help you decide: https://tomassetti.me/parsing-html/ If you don't want to go to the link, here is the gist of the different parsers:

Java

  • Lagarto and Jerry
  • HtmlCleaner
  • Jsoup

C#

  • AngleSharp
  • HtmlAgilityPack

Python

  • HTML Parser of The Standard Library
  • Html5lib
  • Html5-parser
  • Lxml
  • AdvancedHTMLParser
  • Beautiful Soup

JavaScript

  • Browser

    • jQuery
    • DOMParser
  • Node.js

    • Cheerio
    • Jsdom
    • Htmlparser2
    • Parse5
Sheshank S.
  • 3,053
  • 3
  • 19
  • 39
  • Nope, I'm not parsing the HTML it's just that I have a text input for which whenever a user enters a URL i want to convert it to clickable link and now the user thought of giving a name to those link using anchor tag, apart from this all other things are simple text. – Sandeep Gupta Jul 04 '18 at 04:53
0

try to use this:

(a\shref=".+"|\/?b|Google Homepage|\/?a)
  • (a\shref=([a-z]|\.|\?|\"|)+|\/?b|([a-zA-Z]+)|\/?a) – cristian leonardi Jul 03 '18 at 13:27
  • can there be a general solution, this is specifically for tag, there is a possibility of any other tag also. something by which I can say not select inside anchor tag (whatever there is, inside the tag). btw thanks for helping. – Sandeep Gupta Jul 03 '18 at 13:31
  • Oh, now I'm understand. Try this also if you want: ' > ([A-Za-z0-9 \. \? \! \ S] +) < ' – cristian leonardi Jul 03 '18 at 14:04
  • 1
    Please don't advise people to use Regex to parse html. That's a trap. It might work for their very small, specific problem here, but when that is solved they will just stumble into the next and then the next and then the next. Suggest a proper solution instead (or at least on top). – nvoigt Jul 03 '18 at 15:11
  • I made a different answer if that helps @nvoigt – Sheshank S. Jul 03 '18 at 16:39