1

What would be the C# / regex syntax to remove the link from the first image in a body of text like:

text
<a href="..." class="..."><img src="..." class="..." width="..." /></a>
more text
<a href="..." class="..."><img src="..." class="..." width="..." /></a>
even more text

So that the final result would be:

text
<img src="..." class="..." width="..." />
more text
<a href="..." class="..."><img src="..." class="..." width="..." /></a>
even more text

Any advice would be greatly appreciated! Thanks in advance.

jessehouwing
  • 106,458
  • 22
  • 256
  • 341
Mark Hardin
  • 527
  • 1
  • 7
  • 15
  • 1
    Instead of removing the link I'd rather give you a new one: http://stackoverflow.com/a/1732454/93462 – Igor Korkhov Mar 07 '12 at 16:57
  • 1
    Obligatory- see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Chris Shain Mar 07 '12 at 16:57
  • http://stackoverflow.com/a/1732454/308851 – chx Mar 07 '12 at 16:58
  • Obligatory Note: [You can't parse HTML with regex](http://stackoverflow.com/a/1732454/21567). You might have a special case here, but this is not obvious from the question. – Christian.K Mar 07 '12 at 16:58
  • Oh my, four identical links within a minute. People really got sensitive to that issue ;-) – Christian.K Mar 07 '12 at 17:00
  • Yea, this is simply for a quick piece of throw away code. It didn't have to be elegant. Thanks for the posts! – Mark Hardin Mar 07 '12 at 18:20
  • Kudos for requesting a way to do it with C# and/or regex. As answered below, you can use the HTML Agility Pack with C# to solve this without writing a Regex. You can also solve this using Regex if your data is very consistent and/or you only need to use it once. – jessehouwing Mar 11 '12 at 16:02

3 Answers3

2

Do yourself a favor and use something like HTML Agility Pack. As we mentioned in the comments, regex and HTML only leads to tears.

Chris Shain
  • 50,833
  • 6
  • 93
  • 125
1

Using the HTML Agility Pack (project page, nuget), this does the trick:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("text <a href=\"...\" class=\"...\"><img src=\"...\" class=\"...\" width=\"...\" /></a> more text"
     +" <a href=\"...\" class=\"...\"><img src=\"...\" class=\"...\" width=\"...\" /></a> even more text\"");

var firstImage = doc.DocumentNode.Descendants("img").Where(node => node.ParentNode.Name == "a").FirstOrDefault();

if (firstImage != null)
{
    var aNode = firstImage.ParentNode;
    aNode.RemoveChild(firstImage);
    aNode.ParentNode.ReplaceChild(firstImage, aNode);
}

var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

I personally find this a lot easier on the eyes, as it clearly states what you are trying to accomplish.

  1. Find the first IMG inside an A tag
  2. Store the IMG temporarily
  3. Remove Swap the IMG and the A tag
  4. Save the results.
jessehouwing
  • 106,458
  • 22
  • 256
  • 341
-1

Try this

 <a\s[^>]*href\s*=\s*\"([^\"]*)\"[^>]*>(.*?)</a>
Brian Mains
  • 50,520
  • 35
  • 148
  • 257
mohsen dorparasti
  • 8,107
  • 7
  • 41
  • 61
  • Probably due to the fact that it's just a regex without explanation how it works and why it would be a good solution. And also without the warning that using Regex to make changes to HTML is usually a no-no. Especially since there are very easy to use alternative solutions where you could have pointed the OP towards. – jessehouwing Mar 12 '12 at 18:05
  • not a good reason , he asked a regex or C# code , I suggested one , I'm not to explain every thing , I give points , he should use them to solve his problem . – mohsen dorparasti Mar 12 '12 at 19:57
  • Edited your answer to make it constructive. This solution will help others who stumble upon the same post with a concrete example on how to use it and it provides enough links to external contents to help the reader find a similar solution in the future. – jessehouwing Mar 12 '12 at 20:12