Extract all image urls from html except for those commented out

Question

I am using this regex to get all image urls in an html file:

(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

Is there any way to modify this regex to exclude any img tags that are commented out with html comment ""?

[The pony he comes...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Niet the Dark Absol, Feb 24 '12 at 18:02
@Pekka: because I can't guarantee the html to be 100% "correct" - the app is getting it from non-IT personnel so there is a good chance of [badly] malformed html. — Andrey, Feb 24 '12 at 18:06

score 2 · Accepted Answer · answered Feb 24 '12 at 18:05

If your regex already works for extracting images (which would be a miracle in itself), consider a regex to strip HTML comments, like so:

<!--.*?-->

Replace that with an empty string, and any images inside the comment will no longer show up in your other regex.

Alternatively, if you're using PHP (you didn't tag a programming language), you can use the strip_tags function with "<img>" as the "allowable tags" parameter. This will strip out HTML comments, as well as other tags that may interfere with your regex.

And yes, the regex is already working for extracting image urls just fine. — Andrey, Feb 24 '12 at 18:11

score 0 · Answer 2 · answered Feb 24 '12 at 22:10

It's actually also very simple when using the HTML agility pack, there's a bunch of settings in there that helps fixing bad HTML if needed. Like:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = true;
// etc, just set them before calling Load or LoadHtml

http://htmlagilitypack.codeplex.com/

string textToExtractSrcFrom = "... your text here ...";

doc.LoadHtml(textToExtractSrcFrom);

var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    string src = node.Attributes["src"].Value;
}

//or 
var links = nodes.Select(node => node.Attributes["src"].Value);

Extract all image urls from html except for those commented out

2 Answers2