When working with a gnarly mess of nested tables, how do we extract a particular one using jsoup?
Take for example the HTML below, a pile of tables. Scan down in the second half for the two key tables, each with a th
cell displaying either DOG
or CAT
.
Sometimes I want the dog-table, sometimes the cat-table. There could be a dozen of these ("BIRD", "MOUSE", "HAMSTER", and so on). The cat-table might be nested deeper than the dog-table. So I cannot use any tricks about "first" or "last". I have to look at the th
cell’s value, and then fetch the immediate containing table.
The following jsoup code gets me two elements:
Elements elements = document.select( "table:has(tbody > tr > th > b:containsOwn(CAT))" );
With that line I get two elements rather than one:
- The table I want.
- The outer table containing the table I want.
At this point my workaround is to examine the length, and go with the shorter one. But there must be a better way.
HTML:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>title</title>
</head>
<body>
<!-- page content -->
<table> <!--Outer table. Do not want this.-->
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<th><b>DOG</b></th> <!-- DOG in header -->
</tr>
<tr>
<td>X</td>
<td>7</td>
</tr>
</tbody>
</table>
</td>
<td>
<table> <!-- I want this table because it contains a header ("th") displaying the value "CAT". -->
<tbody>
<tr>
<th><b>CAT</b></th> <!-- CAT in header -->
</tr>
<tr>
<td>A</td>
<td>1</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
I also tried the following Java app with jsoup version 1.7.3.
package com.example.jsoupexperiment;
import java.io.InputStream;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
* PURPOSE To test parsing of nested tables using the "jsoup" library, as
* discussed on this StackOverflow.com question:
* http://stackoverflow.com/q/24719049/642706
* Titled: Extract a table whose `th` header displays a certain value, using jsoup
*/
public class ParseNestedTables
{
public static void main( String[] args )
{
System.out.println( "Running main method of ParseNestedTables class." );
InputStream stream = ParseNestedTables.class.getResourceAsStream( "/bogus.html" );
Scanner scan = new Scanner( stream );
StringBuilder sb = new StringBuilder();
while ( scan.hasNextLine() ) {
sb.append( scan.nextLine() + "\n" );
}
// System.out.println(sb.toString());
Document document = Jsoup.parse( sb.toString() );
Elements elements = document.select( "table:eq(0):has(th:contains(CAT))" );
int countElements = elements.size(); // Hoping for 1, but getting 2.
System.out.println( "Found " + countElements + " elements. Dumping… \n\n" );
for ( Element element : elements ) {
System.out.println( "Element…\n" + element.toString() + "\n\n" );
}
}
}
But it returns two elements rather than one:
- The outer table containing the desired table.
- The desired table.
Another problem is that, while I don't exactly understand eq
selector’s behavior, if it merely picks among elements that are siblings next to each other at the same point in the hierarchy, then this would not be a correct answer even if it worked in this example. In my Question's real application, the tables can be nested arbitrarily in any number of other tables. The other tables relate to page layout with no direct logical connection to my desired tables.