0

I am using this regex to get all image urls in an html file:

(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

Is there any way to modify this regex to exclude any img tags that are commented out with html comment ""?

Andrey
  • 20,487
  • 26
  • 108
  • 176
  • Why not use a proper HTML parser instead? – Pekka Feb 24 '12 at 18:01
  • 2
    [The pony he comes...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Niet the Dark Absol Feb 24 '12 at 18:02
  • @Pekka: because I can't guarantee the html to be 100% "correct" - the app is getting it from non-IT personnel so there is a good chance of [badly] malformed html. – Andrey Feb 24 '12 at 18:06

2 Answers2

2

If your regex already works for extracting images (which would be a miracle in itself), consider a regex to strip HTML comments, like so:

<!--.*?-->

Replace that with an empty string, and any images inside the comment will no longer show up in your other regex.

Alternatively, if you're using PHP (you didn't tag a programming language), you can use the strip_tags function with "<img>" as the "allowable tags" parameter. This will strip out HTML comments, as well as other tags that may interfere with your regex.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
0

It's actually also very simple when using the HTML agility pack, there's a bunch of settings in there that helps fixing bad HTML if needed. Like:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = true;
// etc, just set them before calling Load or LoadHtml

http://htmlagilitypack.codeplex.com/

string textToExtractSrcFrom = "... your text here ...";

doc.LoadHtml(textToExtractSrcFrom);

var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    string src = node.Attributes["src"].Value;
}

//or 
var links = nodes.Select(node => node.Attributes["src"].Value);
jessehouwing
  • 106,458
  • 22
  • 256
  • 341