0

Possible Duplicate:
Regular expression for parsing links from a webpage?
RegEx match open tags except XHTML self-contained tags

i need a regular expression to strip html <a> tags , here is sample:

<a href="xxxx" class="yyy" title="zzz" ...> link </a>

should be converted to

 link
Community
  • 1
  • 1
ShirazITCo
  • 1,041
  • 6
  • 23
  • 38

5 Answers5

13

I think you're looking for: </?a(|\s+[^>]+)>

Bill Criswell
  • 32,161
  • 7
  • 75
  • 66
3

Answers given above would match valid html tags such as <abbr> or <address> or <applet> and strip them out erroneously. A better regex to match only anchor tags would be

</?a(?:(?= )[^>]*)?>
rbrignoni
  • 46
  • 1
2

You're going to have to use this hackish solution iteratively, and it won't probably even work perfectly for complicated HTML:

<a(\s[^>]*)?>.*?(</a>)?

Alternatively, you can try one of the existing HTML sanitizers/parsers out there.


HTML is not a regular language; any regex we give you will not be 'correct'. It's impossible. Even Jon Skeet and Chuck Norris can't do it. Before I lapse into a fit of rage, like @bobince [in]famously once did, I'll just say this:

Use a HTML Parser.

(Whatever they're called.)


EDIT:

If you want to 'incorrectly' strip out </a>s that don't have any <a>s as well, do this:

</?[a\s]*[^>]*>
Community
  • 1
  • 1
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
  • 1
    Your regex: `]*)?>()?` does not match `` closing tags (except for the case where the A element is empty). – ridgerunner Sep 26 '11 at 15:46
  • @ridgerunner Since regexes don't have memory, putting a `.*?` in between the two is the best I can do. It'll break down for more complicated HTML. – Mateen Ulhaq Sep 26 '11 at 23:15
  • Just curious: Why are you worried about the tag's text at all? – Bill Criswell Sep 28 '11 at 14:52
  • @BillCriswell Oh, damn, I just realized the OP probably doesn't need a 'regex' which will *not* strip out unmatched ``s. (That would be incorrect, but I don't think the OP would care. :)) – Mateen Ulhaq Sep 28 '11 at 23:11
2

Here's what I would use:

</?a\b[^>]*>

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
1

</?a.*?> would work. Replace it with ''

arviman
  • 5,087
  • 41
  • 48