2

I just recently a inconsistent Jsoup behavior when it comes to the tbody tags, When I'm parsing a distant page on the web with a Html structure like:

<table>
   <tbody>
     <tr><td>... text
   </tbody>
</table>

Jsoup does not include the tbody element in the elements returned by the select method().

I use the method connect().get() to load the remote page in a Document variable like:

Document doc = Jsoup.connect(url).get();
String expr = "table>tr>td";
String parsedTxt = doc.select(expr).text();

But when I parsed the same page on my local disk (after I downloading it). Jsoup includes the tbody tag. My expression will not work anymore because it's missing the tbody element.

I use:

File input = new File(locationOfFile);
Document doc = Jsoup.parse(input, "UTF-8", "");

My Jsoup expression works only in the first case.

Is there a way to force Jsoup to recognize the tbody element (or to remove it) so the same expression can used in both cases?

Is this a normal behavior from Jsoup?

Should I be using the connect method in parsing the local page as well?

Chris Martin
  • 30,334
  • 10
  • 78
  • 137
Alain
  • 702
  • 2
  • 13
  • 33

3 Answers3

1

It sounds like the browser you used to save the file included/created tbody tags when it saved the file. Which browser did you use to save the file to your desktop?

I would try downloading the file manually using curl or wget and then trying the parse from file.

Femi
  • 64,273
  • 8
  • 118
  • 148
  • I am using Mozilla Firefox on a Windows machine. I also forgot to mention I am using Firebug 1.7.2 plugin to view the html structure of the page. I just tried downloading the page with the wget. After opening the downloaded page the tbody tag is present. – Alain Jun 20 '11 at 09:54
  • Instead of using the expression "table>tr>td" would "table tr>td" work. In other words, if I tried to fetch the childs nodes descending from the table element instead of trying to fetch the direct childs maybe in could work in both cases? – Alain Jun 20 '11 at 09:57
  • you could also use the `table>tr>td, table>tbody>tr>td` selector to make sure to catch both things. – Stefan Jan 22 '13 at 07:31
0

Instead of inspecting the element with firebug try to search for in the source (Show page source). You should try to print/inspect

Document.html() 

and see if JSOUP actually got the whole html. If it did then the next step would be to report it on JSOUP https://github.com/jhy/jsoup/issues

If it did not (which is most likely), you should try adding additional headers to your get request( like user-agent and cookies). AJAX could also be the problem in which case you should use Selenium http://seleniumhq.org

Manik
  • 13
  • 3
0

You can try Jsoup 1.7.3. It works for your situation. The sample code is following.

    String html
            = "<table>\n"
            + "<tbody>\n"
            + "<tr><td>... text.\n"
            + "</tbody>\n"
            + "</table>";
    Document doc = Jsoup.parse(html);
    Elements eles = doc.select("tbody > tr > td");

    for (Element ele : eles) {
        System.out.println(ele.toString());

    }

The result is this:

    <td>... text. </td>
Shaowei Ling
  • 186
  • 10