5

i have a serious problem. i would like to extract the content from tag such as:

<div class="main-content">
    <div class="sub-content">Sub content here</div>
      Main content here </div>

output i would expect is:

Sub content here
Main content here

i've tried using regex, but the result isn't so impressive. By using:

Pattern.compile("<div>(\\S+)</div>");

would return all the strings before the first <*/div> tag
so, could anyone help me pls?

kyo21
  • 53
  • 1
  • 1
  • 4
  • 1
    Don not use Regular expression for HTML parsing. Use a HTML parser, refer to this question: http://stackoverflow.com/questions/238036/java-html-parsing – rkg May 17 '11 at 05:35
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) – Tomas F Oct 28 '15 at 14:30

2 Answers2

8

I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

In response to comment: if you want to put the content of the div elements into an array of Strings you can simply do:

    String[] divsTexts = new String[divs.size()];
    for (int i = 0; i < divs.size(); i++) {
        divsTexts[i] = divs.get(i).ownText();
    }

In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">" +
            "<p>a paragraph <b>with some bold text</b></p>" +
            "Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div, p, b");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

The code above will parse the following HTML:

<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>

and print the following output:

Main content here
Sub content here
a paragraph
with some bold text
MarcoS
  • 13,386
  • 7
  • 42
  • 63
  • err...what if i would like to add each
    content into array? any suggestion? thanks
    – kyo21 May 18 '11 at 09:05
  • @kyo21: I added some code to my answer to answer your question on having the `div` contents into an array. – MarcoS May 18 '11 at 09:13
  • oh, sorry, i need ur explanation again, i use method element.text() to acquire all text inside
    tag, i've added tag

    at div content, but the result: -Sub content here Main content here - Sub content here how could this happen?

    – kyo21 May 18 '11 at 14:54
  • @kyo21: `text()` gets the combined text of this element and all its children. See [jsoup javadocs](http://jsoup.org/apidocs/index.html?org/jsoup/nodes/Element.html) – MarcoS May 18 '11 at 14:58
  • @kyo21: I'm not quite sure what html you have now, but note that if you have two nested `div` tags each containing text, and you use the `text()` method, then the text of the innermost `div` tag is printed twice: once when you call `text()` on the outer `div`, and once when you call it on the inner `div` (the `for` loop process all `div` tags). I hope this helps. – MarcoS May 18 '11 at 15:07
  • i'm workin on extracting content from news web page, most of them have nested div tags containing text tag:

    ,,etc. using text()method would obviously get all text contents, but printed the innermost

    contents twice as u said. how to prevent this? do u have any idea or any other methods? the reult i would expect: - Main content here - Sub content here , Thanks

    – kyo21 May 19 '11 at 03:23
  • @kyo21: well, why don't you use `ownText()` as in my example? That returns only the text of the element, and not that of its nested elements. So, you can select the elements that you're interested in, process them one by one, and call `ownText()` to retrieve their own text (if any). – MarcoS May 19 '11 at 06:07
  • yeah, it's true but ownText() will not return content inside

    , tag,, err...or i just remove the

    ,and tag, so it will not bother me anymore?? anyway, thanks for your help ... :)

    – kyo21 May 19 '11 at 07:04
  • @kyo21: oh that's easy: Jsoup supports CSS/jquery selector syntax, so you can write `document.select("div, p, b")` :) I've edited my answer to address your comment. I hope this helps. – MarcoS May 19 '11 at 07:21
  • @MarcoS out of curiosity, does jsoup has any method to remove a particular tag, ex: i would like to remove tag
    inside my html? i guess it will help me alot in future
    – kyo21 May 20 '11 at 04:50
  • @kyo21: as far as I know, yes: have a look at [remove](http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#remove%28%29) in jsoup javadaoc – MarcoS May 20 '11 at 09:56
  • ah, there is... it's said this method will remove all the available child nodes also right? if i want to remove only the
    tag in: '
    text
    ', it's impossible for me to retain the "text" .. isn't it?
    – kyo21 May 20 '11 at 14:16
  • @kyo21: I don't remember: try ... and if not, browse the javadoc to look at what other methods do :) – MarcoS May 20 '11 at 14:33
2
<div class="main-content" id="mainCon">
    <div class="sub-content" id="subCon">Sub content here</div>
 Main content here </div>

From this code if you want to get the result you have mentioned

Use document.getElementById("mainCon").innerHTML it will give Main content here along with sub div but you parse that thing.

And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML

Ankit
  • 2,753
  • 1
  • 19
  • 26
  • @kyo21 : Yes you give manual id to each div and you can also give it dynamically with javascript. – Ankit May 17 '11 at 05:57