It is possible to get an object reference to individual TextNodes. I think maybe you over looked Jsoup's TextNode Object.
The text at the top level of an Element is an instance of a TextNode Object. For instance, " some text 1" and " some text 3" are both TextNode Objects under "< div class='info' >" and "Line 1:" is a TextNode Object under "< strong >"
Element Objects have a textNodes() method which will be of use for you to get a hold of these TextNode Objects.
Check the following code:
String html = "<html>" +
"<body>" +
"<div class="info">" +
"<strong>Line 1:</strong> some text 1<br>" +
"<b>some text 2</b><br>" +
"<strong>Line 3:</strong> some text 3<br>" +
"</div>" +
"</body>" +
"</html>";
Document document = JSoup.parse(html);
Element infoDiv = document.select("div.info").first();
List<TextNode> infoDivTextNodes = infoDiv.textNodes();
This code finds the first < div > Element who has an Attribute with key="class" and value="info". Then get a reference to all of the TextNode Objects directly under "< div class='info' >". That list looks like:
List<TextNode>[" some text 1", " some text 3"]
TextNode Objects have some sweet data and methods associated with them which you can utilize, and extends Node giving you even more functionality to utilize.
The following is an example of getting object references for each TextNode inside div's with class="info".
for(Iterator<Element> elementIt = document.select("div.info").iterator(); elementIt.hasNext();){
Element element = elementIt.next();
for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
TextNode textNode = textIt.next();
//Do your magic with textNode now.
//You can even reference it's parent via the inherited Node Object's
//method .parent();
}
}
Using this nested iterator technique you can access all the text nodes of an object and with some clever logic you can just about do anything you want within Jsoup's structure.
I have implemented this logic for a spell checking method I have created in the past and it does have some performance hits on very large html documents with a high number of elements, perhaps a lot of lists or something. But if your files are reasonable in length, you should get sufficient performance.
The following is an example of getting object references for each TextNode of a Document.
Document document = Jsoup.parse(html);
for (Iterator<Element> elementIt = document.body().getAllElements().iterator(); elementIt.hasNext();) {
Element element = elementIt.next();
//Maybe some magic for each element..
for (Iterator<TextNode> textIt = element.textNodes().iterator(); textIt.hasNext();) {
TextNode textNode = textIt.next();
//Lots of magic here for each textNode..
}
}