1

I'm trying to write a regular expression that will match all image tags apart from the first in a html file. E.g:

<html><body><img src="foo"><span><img src="bar></span><img src="foobar"></body></html>

So far I've only managed to create an expression that matches all of the image tags:

<img[^>]*>
Jim
  • 1,333
  • 1
  • 11
  • 15
  • 5
    Why not just match them all and skip the first result in your code? – Scott Chamberlain Feb 17 '15 at 18:01
  • 3
    dont use regex to parse HTML http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – pm100 Feb 17 '15 at 18:02
  • You shouldn't use regex against HTML. You should be using a library that understand HTML, such as the [HTML Agility Pack](http://htmlagilitypack.codeplex.com/). – mason Feb 17 '15 at 18:02
  • @ScottChamberlain I might do something like that. What I wanted to do was delete all of the img tags apart from the first one. It still doesn't look straight forward with this approach. – Jim Feb 17 '15 at 18:16
  • @Jim any purely regex solution will be quite complicated and perform very poorly compared to just skipping the first match. – Scott Chamberlain Feb 17 '15 at 19:05

2 Answers2

4

Just use a real html parser like HtmlAgilityPack to parse an html

var html = @"html><body><img src=""foo""><span><img src=""bar""></span><img src=""foobar""></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var imgLinks = doc.DocumentNode
                    .Descendants("img")
                    .Skip(1)
                    .Select(x => x.Attributes["src"])
                    .ToList();

Don't do this

var pattern = @"<img[^>]*>"; //your pattern in question
var imgs = Regex.Matches(html, pattern)
                .Cast<Match>()
                .Skip(1)
                .Select(m => m.Value)
                .ToList();
EZI
  • 15,209
  • 2
  • 27
  • 33
  • Why not? by far a html parser is more complex and weight than a simple regex parse. The structure of a tag is parseable this way and no need to parse the whole thing. Does he need also to get all the structure if only wants the images? – Luis Colorado Feb 18 '15 at 16:31
  • @LuisColorado http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – EZI Feb 18 '15 at 16:37
  • That reference has been closed by a great amount of off-topic, as this will be closed if we bring all the discussion here. There's no need to do a complete parse of an html document just to find an tag. The regexp is a valid one and xml and html tags can be defined with a regular grammar, so no need of a whole parser here (and there) – Luis Colorado Feb 18 '15 at 16:46
  • @LuisColorado a) You are dreaming, it is not closed as *off-topic*. b) *You can't parse html with regex.* Just read top 3 of most upvoted answers. – EZI Feb 18 '15 at 16:51
  • It's right that you can't parse **html** with a regex.... but you can find an `` tag in it with a regex. You cannot match opening tags with the proper closing ones, but you can find and extract all the tags. The grammar for just a html tag **is regular** and, as so, it can be parsed with a DFA (or a regex) without difficulty. You cannot parse a PASCAL program with regex, but you can find every numeric literal with a regex in a pascal program. The grammar for a HTML tag is a regular subset of the general HTML grammar. – Luis Colorado Feb 18 '15 at 17:10
  • @LuisColorado So you say you can parse this tag `` – EZI Feb 18 '15 at 17:24
  • yes.... you can.... you had to put the > between quotes... this is the same as you can put a /* */ delimited string inside " quotes and it is interpreted as a string in C without it being confused with a comment. I can write the regexp for a tag in xml but I'll don't do here. Of course, the regex is not the one used above, but you can do the work with one regex. Final comment. – Luis Colorado Feb 18 '15 at 17:27
  • @LuisColorado Just write it. And I can post another case your regex will fail. Final comment. – EZI Feb 18 '15 at 17:30
0

In this answer I'm going to demonstrate that tags can be matched from a regex, contrary to the believings in some comments that a tag cannot be identified but with a complete HTML/XML parser.

For the demonstration I shall use the subset of the grammar rules for XML from the www.www.org specification for XML 1.1, available there, extending to all the rules reachable from STag and EmptyElemTag, which are the tags we want to match. As there are no backward recursive rules, I'll demonstrate that this set of rules can be converted to a regexp to parse start and empty tags respectively.

As xml uses UTF character encodings and it allows characters over the range \u0000-\uffff, I have to select some notation for the character classes in the extended UTF encoding, so I shall use a nonstandard extension to the \u notation consisting of using five hex digits instead of four, to simplify this grammar-to-regexp conversion (to allow for the allowed characters in the range 0x10000-0xeffff)

Borrowed from the xml specification for XML version 1.1 is the syntax for the start and empty element tags:

STag ::= '<' Name (S Attribute)* S? '>'
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
Name ::= (NameStartChar NameChar*)
NameChar ::= (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f-\u02040])
NameStartChar ::= ([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])
S ::= ([\u00020\u00009\u0000d\u0000a]+)
Attribute ::= (Name Eq AttValue)
Eq ::= (S? '=' S?)
AttValue ::= ( '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" )
Reference ::= (EntityRef | CharRef)
EntityRef ::= ('&' Name ';')
CharRef ::= ('&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')

To construct the regular expression that accepts start tag and empty tags, I have begun with the above grammar and construct from it a simple start rule that accepts a start and an empty tag:

Start ::= STag | EmptyElemTag

then substituting all the nonterminals by the (properly parenthesized) right sides of each rule, until I only have terminal elements in the right side and regexp operators:

Start ::= '<' Name (S Attribute)* S? '>' | '<' Name (S Attribute)* S? '/>'

I can do some operations to group terms and get

Start ::= '<' Name (S Attribute)* S? '/'?'>'

Now substitute Attribute

Start ::= '<' Name (S Name Eq AttValue)* S? '/'? '>'

Now substitute AttValue

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" ))* S? '/'? '>'

Now substitute Reference

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | EntityRef | CharRef)* '"' | "'" ([^<&'] | EntityRef | CharRef)* "'" ))* S? '/'? '>'

Now substitute EntityRef

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | CharRef)* '"' | "'" ([^<&'] | '&' Name ';' | CharRef)* "'" ))* S? '/'? '>'

Now substitute CharRef

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'

Now Eq

Start ::= '<' Name (S Name S? '=' S? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'

Next S

Start ::= '<' Name (([\u00020\u00009\u0000d\u0000a]+) Name ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'

Now substitute Name

Start ::= '<' (NameStartChar NameChar*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar NameChar*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'

Now substitute NameChar

Start ::= '<' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'

And last NameStartChar

Start ::= '<' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'

finally, after substituting 'c' by c and eliminating undesired blank spaces, the regex leads to:

<(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)(([\u00020\u00009\u0000d\u0000a]+)(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)([\u00020\u00009\u0000d\u0000a]+)?=([\u00020\u00009\u0000d\u0000a]+)?(\"([^<&\"]|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\"|\'([^<&\']|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\'))*([\u00020\u00009\u0000d\u0000a]+)?/?>

Of course you can have more regexps that allow you to match a start/empty tag, but this is one of the simplests that I have been able to develop to cope with the scenarios that have been pointed out in the comments.

A simpler one could be:

<[iI][mM][gG][ \t\n\r]+([^>"']|"[^"]*"|'[^']*')*>

if you are not dealing with UTF chars outside the range \u0000--\u007f (ascii range) and you know the HTML file is valid. (this last one can be erroneous, use with care I have constructed it in my head and can take some weird cases mistakenly)

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31
  • If someone needs only the `` and `` tags, just substitute the initial Name nonterminal for the fixed string `img`. I think this explanation is not needed, but anyway. – Luis Colorado Feb 18 '15 at 20:09
  • I must admit I find this pretty hard to understand. Should the final expression match anything if I run it again some html or do I need to edit it? The simpler one does match my img tags. – Jim Feb 19 '15 at 10:51
  • The final regexp might work without mod. Just try it. Perhaps it has to be adapted to your environment, as regexp specification is diferent for perl than for php than for java, etc. The last part is to avoid to recognize > inside an attribute value, as they can be delimited both with single quotes and with double. I don't know the exact rules to adapt them to each language. I just use them in vim(1) or C/C++ code. – Luis Colorado Feb 19 '15 at 16:24