3

I'm looking to know how I can strip any hyperlink < a > tags from within some text - the whole lot including the text/image whatever is being linked before the end < / a > tag.

E.g.

<a href="http://stackoverflow.com">Click here</a>        
<a href="http://stackoverflow.com"><img src="http://stackoverflow.com" alt = "blah"></a>

ie. remove the whole lot.

Any ideas how to do this?

Thanks

thegunner
  • 6,883
  • 30
  • 94
  • 143

3 Answers3

1

Obligatory "don't use regex to parse html" warning: RegEx match open tags except XHTML self-contained tags

I would recommend either converting to XHTML and using xPath or taking a look at the HTMLAgilityPack to do this. I have used both methods for parsing/modifying html in the past and they are far more flexible/robust than using regex.

Here is an example that should get you started with HtmlAgilityPack:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
 {
    // Do stuff!
 }
 doc.Save("file.htm");
Community
  • 1
  • 1
Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
  • +1 for HTMLAgilityPack. Regex might suffice for this case but for more complex HTML parsing, HTMLAgilityPack is the way to go. – keyboardP Jun 24 '13 at 18:34
0

From what I understand, this should work

string linksRemoved = Regex.Replace(withLinks, @"</?(a|A).*>", "");
keyboardP
  • 68,824
  • 13
  • 156
  • 205
0

You can try a regular expression to replace your tags. My regex isn't the best but this should get you close.

System.Text.RegularExpressions.Regex.Replace(
     input, 
     @"<a[^>]*?>.*?</a>", 
     string.Empty);
Jay
  • 6,224
  • 4
  • 20
  • 23