4

I am trying to parse below HTML using jsoup but not able to get the right syntax for it.

<div class="info"><strong>Line 1:</strong> some text 1<br>
  <b>some text 2</b><br>
  <strong>Line 3:</strong> some text 3<br>
</div>

I need to capture some text 1, some text 2 and some text 3 in three different variables.

I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.

//div[@class='info']/strong[1]/following::text()

On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?

TylerH
  • 20,799
  • 66
  • 75
  • 101
PTS Admin
  • 41
  • 1
  • 1
  • 4

3 Answers3

5

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

This outputs:

 some text 1
some text 2
Line 3:
 some text 3

Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

laz
  • 28,320
  • 5
  • 53
  • 50
  • I changed my answer to try using your XPath with XOM using TagSoup. – laz Aug 06 '12 at 01:07
  • Thanks for the code I will give tagsoup a shot. Is Tagsoup better than Jsoup? I am pretty new to parsing and just starting to code in java again after 7years so consider me a newbie :). I am simply trying to parse a set of html files stored on my machine to extract useful data and store in a database. Only restriction is that I don't want to make js or image calls in the html as those links don't exist and may slow down the process alot. – PTS Admin Aug 06 '12 at 10:32
  • I just tried running the code and the output is empty. When I printed textElements.size() it was 0. Any idea? – PTS Admin Aug 06 '12 at 11:00
  • Hmm, what versions of XOM and TagSoup are you using? – laz Aug 07 '12 at 03:45
  • Both latest - xom-1.2.8 and tagsoup-1.2.1 – PTS Admin Aug 07 '12 at 12:11
  • Looks like a difference between XOM 1.1 and 1.2.x. I was using 1.1. I'm not yet sure what the difference is though. – laz Aug 14 '12 at 15:16
  • It is a namespace issue. It seems that with the 1.2.x line of XOM invoking `setFeature(Parser.namespacesFeature, false)` has no effect for some reason. It always adds a namespace. I'll update my answer with the working code momentarily. – laz Aug 14 '12 at 15:38
2

It is possible to get an object reference to individual TextNodes. I think maybe you over looked Jsoup's TextNode Object.

The text at the top level of an Element is an instance of a TextNode Object. For instance, " some text 1" and " some text 3" are both TextNode Objects under "< div class='info' >" and "Line 1:" is a TextNode Object under "< strong >"

Element Objects have a textNodes() method which will be of use for you to get a hold of these TextNode Objects.

Check the following code:

String html = "<html>" +
                  "<body>" +
                      "<div class="info">" +
                          "<strong>Line 1:</strong> some text 1<br>" +
                          "<b>some text 2</b><br>" +
                          "<strong>Line 3:</strong> some text 3<br>" +
                      "</div>" +
                  "</body>" +
              "</html>";

Document document = JSoup.parse(html);
Element infoDiv = document.select("div.info").first();
List<TextNode> infoDivTextNodes = infoDiv.textNodes();

This code finds the first < div > Element who has an Attribute with key="class" and value="info". Then get a reference to all of the TextNode Objects directly under "< div class='info' >". That list looks like:

List<TextNode>[" some text 1", " some text 3"]

TextNode Objects have some sweet data and methods associated with them which you can utilize, and extends Node giving you even more functionality to utilize.

The following is an example of getting object references for each TextNode inside div's with class="info".

for(Iterator<Element> elementIt = document.select("div.info").iterator(); elementIt.hasNext();){
    Element element = elementIt.next();

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Do your magic with textNode now.
        //You can even reference it's parent via the inherited Node Object's 
        //method .parent();
    }
}

Using this nested iterator technique you can access all the text nodes of an object and with some clever logic you can just about do anything you want within Jsoup's structure.

I have implemented this logic for a spell checking method I have created in the past and it does have some performance hits on very large html documents with a high number of elements, perhaps a lot of lists or something. But if your files are reasonable in length, you should get sufficient performance.

The following is an example of getting object references for each TextNode of a Document.

Document document = Jsoup.parse(html);

for (Iterator<Element> elementIt = document.body().getAllElements().iterator(); elementIt.hasNext();) {
    Element element = elementIt.next();
    //Maybe some magic for each element..

    for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
        TextNode textNode = textIt.next();
        //Lots of magic here for each textNode..
    }
}
mcdonasm
  • 173
  • 3
  • 8
1

Your problem I think is that of the text you're interested in, only one phrase is enclosed within any defining tags, "some text 2" which is enclosed by <b> </b> tags. So this is easily obtainable via:

String text2 = doc.select("div.info b").text();

which returns

some text 2

The other texts of interest can only be defined as text held within your <div class="info"> tag, and that's it. So the only way that I know of to get this is to get all the text held by this larger element:

String text1 = doc.select("div.info").text();

But unfortunately, this gets all the text held by this element:

Line 1: some text 1 some text 2 Line 3: some text 3

That's about the best I can do, and I'm hoping someone can find a better answer and will keep following this question.

Hovercraft Full Of Eels
  • 283,665
  • 25
  • 256
  • 373