I am using this regex to get all image urls in an html file:
(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])
Is there any way to modify this regex to exclude any img tags that are commented out with html comment ""?
I am using this regex to get all image urls in an html file:
(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])
Is there any way to modify this regex to exclude any img tags that are commented out with html comment ""?
If your regex already works for extracting images (which would be a miracle in itself), consider a regex to strip HTML comments, like so:
<!--.*?-->
Replace that with an empty string, and any images inside the comment will no longer show up in your other regex.
Alternatively, if you're using PHP (you didn't tag a programming language), you can use the strip_tags
function with "<img>"
as the "allowable tags" parameter. This will strip out HTML comments, as well as other tags that may interfere with your regex.
It's actually also very simple when using the HTML agility pack, there's a bunch of settings in there that helps fixing bad HTML if needed. Like:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = true;
// etc, just set them before calling Load or LoadHtml
http://htmlagilitypack.codeplex.com/
string textToExtractSrcFrom = "... your text here ...";
doc.LoadHtml(textToExtractSrcFrom);
var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
string src = node.Attributes["src"].Value;
}
//or
var links = nodes.Select(node => node.Attributes["src"].Value);