1

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a HTML file looking like this:

<html>
  <head>
    <title>foobar</title>
  </head>
  <body>
    bla bla<br />
    {[CONTAINER]}
      Hello
    {[/CONTAINER]}
  </body>
</html>

How do I get the "Hello" in the Container out of the rest of the html file? I've done this in PHP years ago and i remember a REGEX-Function which calls a definde class-function and give the content of the container as a parameter.

Can someone tell me how to do this in Java?

Community
  • 1
  • 1
Maxiking1011
  • 97
  • 1
  • 2
  • 10
  • 1
    @user2029057: Can you state what assumption that we can make about your text? – nhahtdh Jan 31 '13 at 14:38
  • 2
    There are many ways that HTML will trip up attempts at using RegEx. The canonical post is [a well known StackOverflow post] (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), for example handling tag attributes. @Nikita's answer covers it pretty well. The OP is free to use a regex but it would be wise to be careful of the many edge cases. – Kelly S. French Jan 31 '13 at 14:59

3 Answers3

4

You can use regex that matches everything between {[CONTAINER]} and {[/CONTAINER]}. Example:

// Non capturing open tag. Non-capturing mean it won't be included in result when we match it against some text.
String open = "(?<=\\{\\[CONTAINER\\]\\})"; 

// Content between open and close tag.
String inside = ".*?"; 

// Non capturing close tag.
String close = "(?=\\{\\[/CONTAINER\\]\\})";

// Final regex
String regex = open + inside + close;

String text = "<html>..."; // you string here

// Usage
Matcher matcher = Pattern.compile(regex, Pattern.DOTALL).matcher(text);
while (matcher.find()) {
    String content = matcher.group().trim();
    System.out.println(content);
}

But you must be careful. Because it works only for {[CONTAINER]} and {[/CONTAINTER]}. Attributes for this custom tags aren't supported.

You also must be aware that it doesn't handle html tags in any specific way. So if there is a html tags between your CONTENT tags - they will be included.

Mikita Belahlazau
  • 15,326
  • 2
  • 38
  • 43
1

You can parse the HTML using jsoup , more help here

More detailed here

TheWhiteRabbit
  • 15,480
  • 4
  • 33
  • 57
0

Why do you want using Java? You can simply use the DOM API with JavaScript:

document.getElementById("id_container").firstChild.data; // beware of \n char

or in a less efficient way:

document.getElementById("id_container").innerHTML;

However if your file is building on the server you can also use the same API:

http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/package-summary.html

xdevel2000
  • 20,780
  • 41
  • 129
  • 196
  • He's not asking to parse HTML, he's asking to obtain some text from between 2 very specific tags. – Cerbrus Jan 31 '13 at 14:42