Get a part of a html file in java

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have a HTML file looking like this:

<html>
  <head>
    <title>foobar</title>
  </head>
  <body>
    bla bla<br />
    {[CONTAINER]}
      Hello
    {[/CONTAINER]}
  </body>
</html>

How do I get the "Hello" in the Container out of the rest of the html file? I've done this in PHP years ago and i remember a REGEX-Function which calls a definde class-function and give the content of the container as a parameter.

Can someone tell me how to do this in Java?

@user2029057: Can you state what assumption that we can make about your text? — nhahtdh, Jan 31 '13 at 14:38
There are many ways that HTML will trip up attempts at using RegEx. The canonical post is [a well known StackOverflow post] (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), for example handling tag attributes. @Nikita's answer covers it pretty well. The OP is free to use a regex but it would be wise to be careful of the many edge cases. — Kelly S. French, Jan 31 '13 at 14:59

Mikita Belahlazau · Accepted Answer · 2013-01-31T15:13:34.180

4

You can use regex that matches everything between {[CONTAINER]} and {[/CONTAINER]}. Example:

// Non capturing open tag. Non-capturing mean it won't be included in result when we match it against some text.
String open = "(?<=\\{\\[CONTAINER\\]\\})"; 

// Content between open and close tag.
String inside = ".*?"; 

// Non capturing close tag.
String close = "(?=\\{\\[/CONTAINER\\]\\})";

// Final regex
String regex = open + inside + close;

String text = "<html>..."; // you string here

// Usage
Matcher matcher = Pattern.compile(regex, Pattern.DOTALL).matcher(text);
while (matcher.find()) {
    String content = matcher.group().trim();
    System.out.println(content);
}

But you must be careful. Because it works only for {[CONTAINER]} and {[/CONTAINTER]}. Attributes for this custom tags aren't supported.

You also must be aware that it doesn't handle html tags in any specific way. So if there is a html tags between your CONTENT tags - they will be included.

edited Jan 31 '13 at 15:13

answered Jan 31 '13 at 14:39

Mikita Belahlazau

15,326
2
38
43

1

+1 for actually answering the question instead of jumping on the _"don't parse html with regexes"_ bandwagon. – Cerbrus Jan 31 '13 at 14:44
Another thing is that it doesn't care whether there is any HTML markup between them, if any. – nhahtdh Jan 31 '13 at 14:45
thanks for everyone who wrote! That it was :) – Maxiking1011 Jan 31 '13 at 15:01
you only have to write every \ twice then it works! – Maxiking1011 Jan 31 '13 at 15:05

TheWhiteRabbit · Answer 2 · 2013-01-31T14:44:41.747

1

You can parse the HTML using jsoup , more help here

More detailed here

edited Jan 31 '13 at 14:44

answered Jan 31 '13 at 14:24

TheWhiteRabbit

15,480
4
33
57

He's not asking to parse HTML, he's asking to obtain some text from between 2 very specific tags. – Cerbrus Jan 31 '13 at 14:41
of course update with more detailed link – TheWhiteRabbit Jan 31 '13 at 14:44
You're still talking about HTML parsing, there. – Cerbrus Jan 31 '13 at 14:45

score 0 · Answer 3 · answered Jan 31 '13 at 14:34

0

Why do you want using Java? You can simply use the DOM API with JavaScript:

document.getElementById("id_container").firstChild.data; // beware of \n char

or in a less efficient way:

document.getElementById("id_container").innerHTML;

However if your file is building on the server you can also use the same API:

http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/package-summary.html

answered Jan 31 '13 at 14:34

xdevel2000

20,780
41
129
196

He's not asking to parse HTML, he's asking to obtain some text from between 2 very specific tags. – Cerbrus Jan 31 '13 at 14:42

Get a part of a html file in java

3 Answers3