A regular expression question

Question

I have content something like

<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>

What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?

“[…] how can I make it to work so that the right closing div is taken?” – That’s exactly what regular expressions can’t do. Use an HTML parser. — Gumbo, Sep 04 '10 at 15:30
Could you rephrase that question a bit? I don't really understand what you want to do. — Octavian Helm, Sep 04 '10 at 15:32
This has been asked so many times before please use the site search feature for reasons not to do this, or see my post below — Woot4Moo, Sep 04 '10 at 15:35
Stackoverflow should implement a new reason for closing questions: **Trying to parse HTML with regexp** — slebetman, Sep 04 '10 at 15:50

score 1 · Accepted Answer · answered Sep 04 '10 at 15:38

1

You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.

However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.

answered Sep 04 '10 at 15:38

atzz

17,507
3
35
35

You could handle comments with a regular expression implementation that supports recursive patterns too. – Gumbo Sep 04 '10 at 15:44
@Gumbo - hmm, probably... But what if source is not syntactically correct? Personally, I wouldn't be comfortable with a solution that has to explicitly take care of each possibility (what if I miss some?) I'd prefer a (maybe specialized, simplified) parser. – atzz Sep 04 '10 at 16:10

score 0 · Answer 2 · answered Sep 04 '10 at 15:40

0

Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.

answered Sep 04 '10 at 15:40

mattmc3

17,595
7
83
103

score 0 · Answer 3 · answered Sep 04 '10 at 15:55

Delete the first line, delete the last line. Problem solved. No need for RegEx.

The following pattern works well with .Net RegEx implementation:

\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>

And we replace that with \1.

Input:

<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>

Output:

<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>

A regular expression question

3 Answers3