Regular expression to parse images from Html

Question

I was going over a piece of code, and i came across this regular expression

Regex _fileOrImageRegex = new Regex("<\\s*(?<Tag>(applet|embed|frame|iframe|img|link|script|xml))\\s*.*?(?<AttributeName>(src|href|xhref))\\s*=\\s*([\\\"\\'])(?<FileOrImage>.*?)\\3", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

Can someone please explain me the expression in plain words. Its been used to parse all the images, i get that part, i also want to modify the regular expression to include the alt tag for every image tag it matches.

thanks

score 1 · Answer 1 · answered Feb 19 '12 at 03:34

1

You should be using DOM or XPath library to process [X]HTML, using a regular expression to do this sort of thing can be very fragile.

answered Feb 19 '12 at 03:34

jturmel

265
1
3
8

DOM and XPath may be good to process HTML. But some HTML content may not contain end tags as end tags are not mandatory in HTML. When we try to load that HTML in the XDocument we will get error. So in that cases, regex will be a good solution. – Alex Hn. Aug 31 '12 at 15:55

score 1 · Accepted Answer · edited May 23 '17 at 12:03

1

Required link: RegEx match open tags except XHTML self-contained tags

In English, what it does is this:

< matches a HTML open tag

\s* matches any amount of whitespace (tabs, spaces, newlines)

(? is something to not worry about - it's a subgroup but it doesn't store the value

The next lump is possible values for open tags - applet, embed, etc

The () around the values mean "store this value in a subpattern, and make it available as part of my results

The | means "or", so applet or embed, etc - this looks at tag names

\s* more whitespace

.? means "any amount of anything", except for newlines but because of the SingleLine flag (see comments for this answer) is matches "any amount of anything"

(? again, see above, same for the optional values (src, href) - these are the tag attributes

\s=\s* means "a space, followed by an equals sign, followed by any amount of whitespace"

([\\"\\']) the (), see above. The [] mean "any of these characters, in any order", and the \\" and \\' are the " and ' characters, escaped with backslashes

(?.?) we already know (?, and the .? means "optionally, a single one of any character"

The options at the end are modifiers, they make the regex match more things - IgnoreCase makes it case insensitive, Singleline should be obvious, and someone else will tell you what Compiled means, because I don't know the language the regex is written for :)

Edit: You've just updated the first post a little. The <Tag> and <AttributeName> give the match groups a name, so for example, your result of running the regex might look like this:

Array
- Tag = img
- AttributeName = src
- FileOrImage = http://www.mysite.com/a.png

By the way, congratulations on having an awesome name :D

edited May 23 '17 at 12:03

Community

1
1

answered Feb 19 '12 at 03:35

Joe

15,669
4
48
83

thanks for the breakdown, how can i modify the pattern to match both src and alt attribute ? – Joe Feb 19 '12 at 05:19
@joe (the answerer not the questioner) [SingleLine](http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx) makes `.` match newlines, so your description of `.?` is incorrect due to that flag. – Scott Chamberlain Feb 19 '12 at 05:37
@Joe asker: To add more tags, make `(applet|embed|frame|iframe|img|link|script|xml)` say (for example): `(applet|embed|frame|iframe|img|link|script|xml|a|b|u|i)`. To match more attributes, modify that one in a similar way: `(src|href|xhref)` becomes `(src|href|xhref|style|class|id)` – Joe Feb 19 '12 at 13:38
Compiled means that the .NET regular expression library will parse the expression, then convert it to Intermediate Language and then to byte code. That way it executes much faster after compilation, but the initial compilation stage is very expensive and the compiled expression should thus be cached somewhere. See also: http://fxcopcontrib.codeplex.com/wikipage?title=ConsiderMakingRegexReadOnlyAndCompiled&referringTitle=Documentation – jessehouwing Feb 19 '12 at 13:41
@Joe,I actually modified the expression to include alt tag (?(src|href|xhref|alt)), didnt work, i want to include both src and alt tag in the matched image tag – Joe Feb 19 '12 at 16:02
adding the alt tag won't work because it would find only the src or only the alt tag. That is why you'd want to use a DOM tree like the HTML Agility pack instead of a regex. Though, with enough patience and virtue you could get a regex that works. Using a Look-ahead might do the trick, something like `(?=([^>]*(alt\s*=\s*(?['"])(?((?!\k'altquote').)*))\k'altquote')?)` at the start of your expression. – jessehouwing Feb 19 '12 at 22:04
Joe, for that kind of thing, read the Required Link at the top of the answer - as soon as you start needing any kind of "if it's this tag, do something, or that tag, do something", you don't want regex – Joe Feb 20 '12 at 03:44

score 1 · Answer 3 · edited May 23 '17 at 12:27

This is C# specific, but to add to Joe's answer to Joe's question, for readability, this regular expression could use the @, the verbatim string, so the \ escapes are ignored by "String" and given uncorrupted (he comes!) to the regex. You can also use IgnorePatternWhiteSpace to allow you break the chunks down semantically somewhat across multiple lines:

var fileOrImageRegex = new Regex(
    @"<\s*
    (?<Tag>(applet|embed|frame|iframe|img|link|script|xml))
    \s*.*? 
    (?<AttributeName>(src|href|xhref))
    \s*=\s*([""'])
    (?<FileOrImage>.*?)
    \3", 
    RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

score 0 · Answer 4 · answered Feb 19 '12 at 13:53

I couldn't resist to create a version of this solution using the HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtmlString); //or use doc.Load(string path)

var nodes = doc.DocumentNode.SelectNodes("//*[@href or @xref or @src");
if (nodes != null)
{
    foreach (var node in nodes)
    {
        // optionally use interestingTags.Contains(node.Name) to only look in specific tags
        string url = string.Empty;
        string alt = string.Empty;

        if (nodes.Attributes.Contains("href"))
            url = nodes.Attributes["href"].Value;
        if (nodes.Attributes.Contains("xref"))
            url = nodes.Attributes["xref"].Value;
        if (nodes.Attributes.Contains("src"))
            url = nodes.Attributes["src"].Value;

        if (nodes.Attributes.Contains("alt"))
            alt = nodes.Attributes["alt"].Value;

        // So I found a node, what to do with it...
        FoundNode(url, alt);
    }
}

I am familiar and worked with Html Agility pack, but as i mentioned before, this is not my code and i am trying to understand it. — Joe, Feb 19 '12 at 18:14
That I understand, and I applaud you for that. On the other hand, if you need to expand this beast with the ability to also find optional alt tags, then you might want to consider using a more suitable method (read DOM parsing) instead, the code is so much more readable. — jessehouwing, Feb 19 '12 at 22:09
Unfortunately modifying the code base is not an option at this time. thanks for your good intentions though :) if i could , i would — Joe, Feb 20 '12 at 14:52

Regular expression to parse images from Html

4 Answers4