REGEX Matching the whole HTML Document

Question

So, I'm still a REGEX dummy and have only been using them for the past 2 days. However my problem seems odd, to me at least.

The following pattern correctly matches this string for me:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+<td>(([a-z]|[A-Z]|=|\\s)+)<br>

Original String (taken from the html document which is being fed to the regex as input):

<td valign=3D"top">For:</td>     =             <td>XXXXXX XXXXX<br>

and the matched string:

<td valign=3D"top">For:</td>     =             <td>XXXXXX XXXXX<br>

However for this string:

<td valign=3D"top">For:</td>                     <td>YYYYYYY=     YYYYY<br>

it matched the entire html document. I don't understand why this is happening since after my (([a-z]|[A-Z]|=|\\s)+ I specified that there should be a <br> tag

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Mike G, Jul 11 '12 at 13:24
1) What on earth is `valign=3D"top"`? 2) REGEX for parsing DOMs = bad idea unless you really have a good reason to being doing this. Better to use DOM methods on it to extract what you need. — Mitya, Jul 11 '12 at 13:29
@Utkanos valign=3D"top" ... don't ask me, I'm just parsing the html document I didn't create it. Well, ok, I got that angels will be weeping because I used regex for parsing html, and I do know that there are libraries like html agility pack to read html, but I was just building on older code which used regex to parse html documents. For consistency I'm using regex to parse the document — Jonny, Jul 11 '12 at 13:37
besides, I never used any regexes, so it is still beneficial (I'm learning something new :)) — Jonny, Jul 11 '12 at 13:38
Learning REGEX is fine (and a very good idea), it's just you wouldn't want to learn it via a HTML string. If the HTML is even slightly malformed or unpredictable (or, in your case, outright invalid), the REGEX will fail. REGEX isn't a parser. In any case, you really will have to address the invalid HTML issue, though - your task is pretty much a none-starter until that's resolved. — Mitya, Jul 11 '12 at 13:40
The non-valid HTML is being generated by another system god knows where I can't really fix that. I'm using c# to read the html document and unless I use any libraries I have to go through the document parsing it my self, which would take some time — Jonny, Jul 11 '12 at 13:46
Your `(.|\r\n)` group is redundant, the dot matches any character so `\r\n` will never be reached. Also, you can simplify a lot of your `(x|y|z)` sections into character classes. For example `([a-z]|[A-Z]|=|\\s)+` can be simplified to `[a-zA-Z=\\s]`. Also, try to use the `*` repetition character instead of `+` when matching whitespaces, especially in HTML. — Jason Larke, Jul 11 '12 at 14:34
HTML consists of nested structures. You cannot parse a nested structure with plain old regexps. — Ira Baxter, Jul 11 '12 at 14:52

Andrew Cheong · Accepted Answer · 2012-07-11T14:01:25.070

2

Add the indicated question marks for non-greedy matching:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|\r\n|\n)+?<td>(([a-z]|[A-Z]|=|\\s)+?)<br>
                                                    ^                         ^

EDIT:

Further, you can simplify into a character class instead of using alternation:

<td valign=3D\"top\">For:</td>(\\s)+(=)?(.|[\r\n])+?<td>([a-zA-Z=\\s]+?)<br>
                                           ^^^^^^        ^^^^^^^^^^^^

My only question is why your \\s is escaped while your \r\n are not...

EDIT 2:

Use * instead of + where, for example, spaces aren't mandatory; and non-greedy quantifiers are probably always helpful in these cases:

<td valign=3D\"top\">For:</td>(\\s)*?(=)?(.|[\r\n])*?<td>([a-zA-Z=\\s]*?)<br>
                                   ^^       ------ ^-     ------------^-

edited Jul 11 '12 at 14:01

answered Jul 11 '12 at 13:44

Andrew Cheong

29,362
15
90
145

Normally I'd post The Pony Is Coming but you've justified that you're just trying to build on other code. Promise us that you would not have used regex if it were your choice from the beginning, and I think we'll get off your back. ;) – Andrew Cheong Jul 11 '12 at 13:45
lol I promise, having said that it still matched the whole document :( – Jonny Jul 11 '12 at 13:49
Still it didn't work :( ... because \s is a white space according to what I've read at least, hence I need the \\ to escape the \ hence feeding the regex engine with \s. with just \s the compiler will complain that \s is not a valid escape character. – Jonny Jul 11 '12 at 14:00
It's actually because you are using the ``+`` quantifier where ``*`` would be more appropriate. By using ``+``, you require "at least one" in cases where there may be "none." See **EDIT 2**. – Andrew Cheong Jul 11 '12 at 14:01

score 1 · Answer 2 · edited May 23 '17 at 10:09

1

Parsing HTML with regex's is a very bad idea.

See why here: RegEx match open tags except XHTML self-contained tags

Even for parsing very simple things in HTML, using a DOM Parser is generally cleaner (more readable) and less error prone. Even more if you are new to REGEX's

edited May 23 '17 at 10:09

Community

1
1

answered Jul 11 '12 at 13:43

Filipe Palrinhas

1,235
8
9

REGEX Matching the whole HTML Document

2 Answers2