-1

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I am having trouble in parsing a tag using java.

Goal:

My goal is to parse complete div tag with all of its contents, even if it contains sub tags,

like from an HTML

<h2>some random text</h2>
<div id="outerDiv">
  some text
      <div>
          some more text
      </div>
  last text
</div>
<div> some random div <b>bold</b></div>

i want to parse with all its inner contents upto its closing tags, that is:

<div id="outerDiv">
      some text
          <div>
              some more text
          </div>
      last text
    </div>

But what I currently gets, is either in this form or any other random format (dpending upon the changes I try with my expression :) ),

Please help me out to improve my Regex to parse a div with a specific id along with its contents perfectly.

Here is my expression (alot of brackets just to be on safer side :) ):

((<div.*(class=\"afs\")(.)*?>)((.)*?)(((<div(.)*?>)((.)*?)((</div>){1}))*?)((</div>){1}))

Here is my java code:

package rexp;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Rexp {

    public static void main(String[] args) {

        CharSequence inputStr = "asdasd<div class=\"af\">sasa<div><div><div class=\"afs\">as</div>qwessa</div></div></div>asd";


        Pattern pattern = Pattern.compile("((<div.*(class=\"afs\")(.)*?>)((.)*?)(((<div(.)*?>)((.)*?)((</div>){1}))*?)((</div>){1}))");
        Matcher matcher = null;
        matcher = pattern.matcher(inputStr);

        if (matcher.find()) {
            System.out.println("Matched "+matcher.group(1));
        } else {
            System.out.println("Not Matched");
        }
    }
}
Community
  • 1
  • 1
Aqif Hamid
  • 3,511
  • 4
  • 25
  • 38
  • 3
    One obvious answer ~ http://stackoverflow.com/a/1732454/89391. – miku Dec 04 '11 at 23:24
  • 3
    Why don't you do yourself a favor and use a proper parser? Regular expressions are not suitable for parsing HTML. – thkala Dec 04 '11 at 23:27
  • If OP's only need is to parse one tag, I think regexes are perfectly acceptable here, and they CAN do the job depending on the circumstances. – Bryan Dec 04 '11 at 23:30
  • 1
    Regexes can't match closing tags to opening tags, which the OP has asked for. – kdgregory Dec 04 '11 at 23:36
  • @Bryan one tag which contains an indeterminate number of tags is not just "one tag". HTML is not a regular language, and thus cannot be parser using a regular expression. – Andrew Marshall Dec 04 '11 at 23:40
  • @All, guys It is indeed, div in a div, just a depth of two. I just donot want to load a complete library for this, that could be buggy, I just wanted to do this small task with few line of codes without third party tool, taht can be a unstable stuff. – Aqif Hamid Dec 04 '11 at 23:46
  • or any suggestions about 3rd party Parser... – Aqif Hamid Dec 04 '11 at 23:46
  • @AqifHamid What?! A battle-tested, open-source, third-party HTML parser will almost certainly be *far* less buggy than any short, impromptu regular expression for parsing HTML. Perhaps you should read up on [NIH syndrome](http://en.wikipedia.org/wiki/Not_Invented_Here). – Andrew Marshall Dec 04 '11 at 23:50
  • @Andrew, I am making somethig for public release, I usually aviod to mess up with licenses :| – Aqif Hamid Dec 04 '11 at 23:54
  • 2
    "I don't want something buggy, so I'm going to use regex to solve a problem which is **mathematically provably** impossible to solve using regex". That's a.. bit contradictory ;) – Voo Dec 05 '11 at 00:04

2 Answers2

4

I think a regex is the wrong tool for this. I would consider using a lexer/parser library, or just using a 3rd party HTML parsing library. A quick google shows several out there.

yshavit
  • 42,327
  • 7
  • 87
  • 124
3

Regular expressions are not suitable for HTML parsing, since HTML is not a regular language. You would be better off using a proper HTML parser library, such as jsoup or JTidy.

See also this question for more Java HTML parser references.

Community
  • 1
  • 1
thkala
  • 84,049
  • 23
  • 157
  • 201
  • While I agree with your answer, html parsers can be build around regexes. http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 – FailedDev Dec 05 '11 at 00:01
  • @FailedDev No you can't. You may be capable of doing it with backreferences (heck the answer you're linking is using a loop in there), but then that's no longer a regular expression which proves the point quite nicely.. It's mathematically impossible to parse HTML/XML with a regular expression correctly. – Voo Dec 05 '11 at 00:05
  • @Voo Well, I wasn't aware that we are talking about ancient regex engines without backreference support :) - which of course makes the regex able to parse irregular things :) – FailedDev Dec 05 '11 at 00:06
  • @FailedDev: 1. using regular expressions backed by additional logic may be fine. In your link, that big loop and switch statement are not quite part of the regular expression :-) 2. just because you can do something, it doen't mean that you should :-) – thkala Dec 05 '11 at 00:10
  • @thkala 2 - I know :) that's why I agreed + upvoted your answer :) – FailedDev Dec 05 '11 at 00:11
  • @FailedDev There's a nice definition for "regular expression" in CS. Now perl and co have added backreferences to it which makes it more powerful (turing complete? no idea). Then the guy is even using loops in there. If I could bother trying to read perl regexs I'd like to see if it's possible to make it exponential with some input - always a problem with complex regexes.. Great I love parsers that are vulnerable to DOS attacks (or can you prove to me that the regex is always linear wrt to the input? Yeah have fun with that..) – Voo Dec 05 '11 at 00:12