1

I have a HTML string and want to replace all links to just a text.

E.g. having

Some text <a href="http://google.com/">Google</a>.

need to get

Some text Google.

What regex should I use?

lexu
  • 8,766
  • 5
  • 45
  • 63
sashaeve
  • 9,387
  • 10
  • 48
  • 61
  • 4
    Generally speaking *(and probably true in this case)*, you should not use regex to "parse" HTML and work on it ; instead, you should use some tool to manipulate your HTML document via the DOM. – Pascal MARTIN Mar 13 '10 at 12:18
  • "How do I parse HTML with a regex" is probably in the top 10 of asked questions on SO. The answer is: You don't – erikkallen Mar 13 '10 at 12:29
  • 1
    It contains the top voted answer that's for sure! - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Russ Cam Mar 13 '10 at 12:31
  • The task does look simple at first sight but there are plenty of potential issues that can come out and bite you. Handling the correct, simple case is quite easy but experience tells me there will be plenty of incorrect HTML merrily thrown at your code when you're on holiday or on your next project, and you are *usually* expected to have written code to handle many oddities. Regexes (well most likely not a single one but a lot of different ones, together with some procedureal code) can do this but handling the bum cases is hard and loads of people have worked hard on this already elsewhere. – martinr Mar 13 '10 at 12:35
  • Sometimes there is a need for just the basics, where the input format or HTML formatting quality is known. I needed this to strip off some unwanted content before creating a PDF and it worked fine. – Andreas Nov 25 '13 at 22:14

3 Answers3

2

Several similar questions have been posted and the best practice is to use Html Agility Pack which is built specifically to achieve thing like this.

http://www.codeplex.com/htmlagilitypack

Fadrian Sudaman
  • 6,405
  • 21
  • 29
1

I asked about simple regex (thanks Fabrian). The code will be the following:

var html = @"Some text <a href="http://google.com/">Google</a>.";
Regex r = new Regex(@"\<a href=.*?\>");
html = r.Replace(html, "");
r = new Regex(@"\</a\>");
html = r.Replace(html, "");
sashaeve
  • 9,387
  • 10
  • 48
  • 61
  • Welcome. So I take it that this is what you wanted then? If you please accept the answer so not wasting other time to post more answers – Fadrian Sudaman Mar 13 '10 at 12:56
  • this doesn't handle the case where the tag has a different attribute (i.e. title) before href. See my answer below. – Andrew Theken Mar 13 '10 at 13:56
1
var html = "<a ....>some text</a>";
var ripper = new Regex("<a.*?>(?<anchortext>.*?)</a>", RegexOptions.IgnoreCase);
html = ripper.Match(html).Groups["anchortext"].Value;
//html = "some text"
Andrew Theken
  • 3,392
  • 1
  • 31
  • 54