Regular expression for remove html links

Question

Possible Duplicate:
Regular expression for parsing links from a webpage?
RegEx match open tags except XHTML self-contained tags

i need a regular expression to strip html <a> tags , here is sample:

<a href="xxxx" class="yyy" title="zzz" ...> link </a>

should be converted to

 link

Do you 'need' a regular expression? – Matt Fenwick Sep 23 '11 at 16:46 — Matt Fenwick, Sep 23 '11 at 16:46
@josh3736 I will feast on your Unicorn's blood. – Mateen Ulhaq Sep 26 '11 at 23:18 — Mateen Ulhaq, Sep 26 '11 at 23:18
In what language? HTML doesn't have regular expressions. – Bill the Lizard Sep 29 '11 at 01:53 — Bill the Lizard, Sep 29 '11 at 01:53

Bill Criswell · Answer 1 · 2011-09-26T15:07:59.093

13

I think you're looking for: </?a(|\s+[^>]+)>

edited Sep 26 '11 at 15:07

answered Sep 23 '11 at 16:40

Bill Criswell

32,161
7
75
66

When have you ever seen just an tag? – Bill Criswell Sep 26 '11 at 13:21
I edited it to account for strange cases like that anyway. – Bill Criswell Sep 26 '11 at 15:08
2

Doesn't match `< a>` or `< /a>`. – Mateen Ulhaq Sep 28 '11 at 23:13

score 3 · Answer 2 · answered Sep 24 '11 at 21:23

3

Answers given above would match valid html tags such as <abbr> or <address> or <applet> and strip them out erroneously. A better regex to match only anchor tags would be

</?a(?:(?= )[^>]*)?>

answered Sep 24 '11 at 21:23

rbrignoni

46
1

I've used this one with the free edition of sublime text 3. Worked best in my case. – GaryP Mar 01 '14 at 13:26

score 2 · Answer 3 · edited May 23 '17 at 11:53

2

You're going to have to use this hackish solution iteratively, and it won't probably even work perfectly for complicated HTML:

<a(\s[^>]*)?>.*?(</a>)?

Alternatively, you can try one of the existing HTML sanitizers/parsers out there.

HTML is not a regular language; any regex we give you will not be 'correct'. It's impossible. Even Jon Skeet and Chuck Norris can't do it. Before I lapse into a fit of rage, like @bobince [in]famously once did, I'll just say this:

Use a HTML Parser.

(Whatever they're called.)

EDIT:

If you want to 'incorrectly' strip out </a>s that don't have any <a>s as well, do this:

</?[a\s]*[^>]*>

edited May 23 '17 at 11:53

Community

1
1

answered Sep 25 '11 at 03:00

Mateen Ulhaq

24,552
19
101
135

1

Your regex: `]*)?>()?` does not match `` closing tags (except for the case where the A element is empty). – ridgerunner Sep 26 '11 at 15:46
@ridgerunner Since regexes don't have memory, putting a `.*?` in between the two is the best I can do. It'll break down for more complicated HTML. – Mateen Ulhaq Sep 26 '11 at 23:15
Just curious: Why are you worried about the tag's text at all? – Bill Criswell Sep 28 '11 at 14:52
@BillCriswell Oh, damn, I just realized the OP probably doesn't need a 'regex' which will *not* strip out unmatched ``s. (That would be incorrect, but I don't think the OP would care. :)) – Mateen Ulhaq Sep 28 '11 at 23:11

score 2 · Answer 4 · answered Sep 26 '11 at 15:36

2

Here's what I would use:

</?a\b[^>]*>

answered Sep 26 '11 at 15:36

ridgerunner

33,777
5
57
69

score 1 · Answer 5 · answered Sep 23 '11 at 16:44

1

</?a.*?> would work. Replace it with ''

answered Sep 23 '11 at 16:44

arviman

5,087
41
48

i just make a little change that works for me. thanks for help. // , edit your answer. – ShirazITCo Sep 23 '11 at 16:51
Yes of course, I merely gave the RE. You would have to append the `/` prefix/suffix if you were using javascript for instance. You would not have to add anything if you were using the C# regex library. – arviman Sep 23 '11 at 16:54
but there is a little problem. the </a> not striped . – ShirazITCo Sep 23 '11 at 16:55
are you using POSIX or PCRE? i.e `ereg_replace` or `preg_replace` – arviman Sep 23 '11 at 17:06
You're stripping away the whole of `blahblah`. (JavaScript regexes are greedy, right?) – Mateen Ulhaq Sep 25 '11 at 03:03
@muntoo- no it will not. It will find a match in `` and stop. and then find another match at ``. The `.*?` makes the search for `.*` non-greedy. – arviman Sep 25 '11 at 19:36
3

FYI: This also matches elements like , ,
, and .
– Bill Criswell Sep 26 '11 at 15:12
@muntoo - Yes, Javascript regex quantifiers are greedy by default, but can be made non-greedy (or _lazy_) by appending a `?` after the quantifier. i.e. `.*` is greedy but `.*?` is lazy. – ridgerunner Sep 26 '11 at 15:42
@Bill Criswell - Agreed, your solution would be better. – arviman Sep 26 '11 at 15:47

Regular expression for remove html links

5 Answers5

Use a HTML Parser.

EDIT: