1

When working with a gnarly mess of nested tables, how do we extract a particular one using jsoup?

Take for example the HTML below, a pile of tables. Scan down in the second half for the two key tables, each with a th cell displaying either DOG or CAT.

Sometimes I want the dog-table, sometimes the cat-table. There could be a dozen of these ("BIRD", "MOUSE", "HAMSTER", and so on). The cat-table might be nested deeper than the dog-table. So I cannot use any tricks about "first" or "last". I have to look at the th cell’s value, and then fetch the immediate containing table.

The following jsoup code gets me two elements:

 Elements elements = document.select( "table:has(tbody > tr > th  > b:containsOwn(CAT))" );

With that line I get two elements rather than one:

  • The table I want.
  • The outer table containing the table I want.

At this point my workaround is to examine the length, and go with the shorter one. But there must be a better way.

HTML:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>title</title>
    </head>
    <body>
        <!-- page content -->
        <table>  <!--Outer table. Do not want this.-->
            <tbody>
                <tr>
                    <td>

                        <table>
                            <tbody>
                                <tr>
                                    <th><b>DOG</b></th> <!-- DOG in header -->
                                </tr>
                                <tr>
                                    <td>X</td>
                                    <td>7</td>
                                </tr>
                            </tbody>
                        </table>

                    </td>
                    <td>

                        <table> <!-- I want this table because it contains a header ("th") displaying the value "CAT". -->
                            <tbody>
                                <tr>
                                    <th><b>CAT</b></th>  <!-- CAT in header -->
                                </tr>
                                <tr>
                                    <td>A</td>
                                    <td>1</td>
                                </tr>
                            </tbody>
                        </table>

                    </td>
                </tr>
            </tbody>
        </table>
    </body>
</html>

I also tried the following Java app with jsoup version 1.7.3.

package com.example.jsoupexperiment;

import java.io.InputStream;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * PURPOSE To test parsing of nested tables using the "jsoup" library, as 
 * discussed on this StackOverflow.com question:
 * http://stackoverflow.com/q/24719049/642706
 * Titled: Extract a table whose `th` header displays a certain value, using jsoup
 */
public class ParseNestedTables
{
    public static void main( String[] args )
    {
        System.out.println( "Running main method of ParseNestedTables class." );
        InputStream stream = ParseNestedTables.class.getResourceAsStream( "/bogus.html" );
        Scanner scan = new Scanner( stream );
        StringBuilder sb = new StringBuilder();
        while ( scan.hasNextLine() ) {
            sb.append( scan.nextLine() + "\n" );
        }
        // System.out.println(sb.toString());
        Document document = Jsoup.parse( sb.toString() );
        Elements elements = document.select( "table:eq(0):has(th:contains(CAT))" );
        int countElements = elements.size(); // Hoping for 1, but getting 2.
        System.out.println( "Found " + countElements + " elements. Dumping… \n\n" );

        for ( Element element : elements ) {
            System.out.println( "Element…\n" + element.toString() + "\n\n" );
        }

    }
}

But it returns two elements rather than one:

  1. The outer table containing the desired table.
  2. The desired table.

Another problem is that, while I don't exactly understand eq selector’s behavior, if it merely picks among elements that are siblings next to each other at the same point in the hierarchy, then this would not be a correct answer even if it worked in this example. In my Question's real application, the tables can be nested arbitrarily in any number of other tables. The other tables relate to page layout with no direct logical connection to my desired tables.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • `document.select("table:eq(0):has(th:contains(CAT))");` perhaps? – Hovercraft Full Of Eels Jul 13 '14 at 02:49
  • @HovercraftFullOfEels Nope, that line gives same results (two elements, both outer table and desired table). Thanks for trying. – Basil Bourque Jul 13 '14 at 02:55
  • No problem. Note that I did edit my comment. – Hovercraft Full Of Eels Jul 13 '14 at 02:58
  • @HovercraftFullOfEels So noted. I tried both `"table:lt(1):has(th:contains(CAT))"` and `"table:eq(0):has(th:contains(CAT))"`. Those should have the correct meaning? Anyways, both have same result, both the outer and inner table are returned, as a pair of elements. – Basil Bourque Jul 13 '14 at 03:04
  • is it your requirement is must solved use selector? If yes, I think you might not able to get the answer - refer to http://stackoverflow.com/questions/1520429/css-3-content-selector . There is no css selector that can select based on the text. (Based on that answer). – hutingung Jul 14 '14 at 07:08
  • @hutingung I'm looking for any solution that can find a `table` whose `th` cell contains desired text, while avoiding any number of outer arbitrarily-nesting (page-layout) tables. – Basil Bourque Jul 14 '14 at 07:28
  • Jsoup using css selector to select element which has limitation to select based on text content. Refer to my previous comment. I think better tools to solve this is to use XPATH rather than Jsoup - example XPATH selector - //th/b[text()='CAT']. But unfortunately Jsoup is not able to support XPATH yet - http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath . – hutingung Jul 14 '14 at 08:00
  • I read back the selector document and there is contains implementation. I posted the answer. – hutingung Jul 14 '14 at 08:25

4 Answers4

1

Workaround: Find Target Value, Go Up The Hierarchy

Another workaround. Not a true answer as it does not improve on the jsoup selector.

We know which table we want by the value of its th header cell. So find that element, then work backwards. Go up the hierarchy of elements (the DOM tree), past the tr and tbody, until we reach the table. We know this is the direct table owning the target th. We avoid the outer nesting tables.

Key code includes finding the th cell:

Elements elements = document.select( "th > b:containsOwn(CAT)" ); 

…and looping to find each parent:

Element element = elements.first();
while (  ! ( ( element == null ) || ( element.tagName().equalsIgnoreCase( "table" ) ) ) ) {
    element = element.parent();
}

Complete example app:

package com.example.jsoupexperiment;

import java.io.InputStream;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseNestedTables
{
    public static void main ( String[] args )
    {
        System.out.println( "Running main method of ParseNestedTables class." );
        InputStream stream = ParseNestedTables.class.getResourceAsStream( "/bogus.html" );
        Scanner scan = new Scanner( stream );
        StringBuilder sb = new StringBuilder();
        while ( scan.hasNextLine() ) {
            sb.append( scan.nextLine() + "\n" );
        }

        Document document = Jsoup.parse( sb.toString() );
        Elements elements = document.select( "th > b:containsOwn(CAT)" ); // Start by finding the desired table's target "th" element.
        int countElements = elements.size();
        switch ( countElements ) {
            case 0:
                System.out.println( "ERROR: Found no elements." );
                break;
            case 1:
                System.out.println( "GOOD: Found 1 element." );
                Element element = elements.first();

                // Loop up the hierarchy of elements (the DOM tree) until we find our desired "table" element or until we get a null.
                while (  ! ( ( element == null ) || ( element.tagName().equalsIgnoreCase( "table" ) ) ) ) {
                    element = element.parent();
                }

                System.out.println( "Found Element:\n" + element.toString() );
                break;
            default:
                System.out.println( "ERROR: Found multiple elements: " + countElements );
                break;
        }
    }
}
Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
1

Basically I used two selectors. document.select("table table"); to select nested table and element.select("th b:contains(CAT)").size() > 0 to check on the element that th contain CAT.

package com.example.jsoupexperiment;

import java.io.IOException;
import java.io.InputStream;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * PURPOSE To test parsing of nested tables using the "jsoup" library, as
 * discussed on this StackOverflow.com question:
 * http://stackoverflow.com/q/24719049/642706 Titled: Extract a table whose `th`
 * header displays a certain value, using jsoup
 */
public class ParseNestedTables2 {
    public static void main(String[] args) throws IOException {
        System.out.println("Running main method of ParseNestedTables class.");
        InputStream stream = ParseNestedTables2.class
                .getResourceAsStream("/bogus.html");
        Document document = Jsoup.parse(stream, "UTF-8", "http://example.com");
        Elements elements = document.select("table table");
        for (Element element : elements) {
            if (element.select("th b:contains(CAT)").size() > 0) {
                System.out
                        .println("table that have th contain selected text (CAT)");
                System.out.println(element);
            }
        }
    }

}

*I refactored code a bit on how to use JSOUP parse from input stream.

hutingung
  • 1,800
  • 16
  • 25
  • Thanks for trying, but you took my example HTML too literally. The problem stems from using HTML *tables for layout* of content that includes desired *data tables*. So the nesting is arbitrary and capricious. Your query "table table" assumes the desired table is nested, which may not be so. See [my own workaround answer](http://stackoverflow.com/a/24731511/642706) which is similar to yours but makes less assumptions about the page: Find the desired cell element of the desired table, then use calls to `.parent` to find the first enveloping table. But good answer; SO needs more jsoup examples. – Basil Bourque Jul 23 '14 at 01:42
  • yes. you are right. Your answer is more dynamic. But I will choose to implement jsoup per site(or page). As far as I know in the real world site, the page structure will be quite consistent. (*I used jsoup extensively as transformer (proxy) to consolidate search result from different sites). – hutingung Jul 23 '14 at 02:20
0

I don't know as it seemed to work fine for me:

import java.util.Scanner;    
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class ThHeaderTest {
   public static void main(String[] args) {
      String resource = "thHeader.txt";
      Scanner scan = new Scanner(ThHeaderTest.class.getResourceAsStream(resource));
      StringBuilder sb = new StringBuilder();
      while (scan.hasNextLine()) {
         sb.append(scan.nextLine() + "\n");
      }
      // System.out.println(sb.toString());
      Document document = Jsoup.parse(sb.toString());
      Elements elements = document.select("table:eq(0):has(th:contains(CAT))");
      System.out.println(elements);
   }
}
Hovercraft Full Of Eels
  • 283,665
  • 25
  • 256
  • 373
  • I tried almost exactly this code. Fails for me, returning two elements rather than one. I posted my code in the Question. I also posted in the Question an exact copy of the HTML used in my experiment with this code. Should be semantically the same as before, but for the sake of thoroughness I replaced the previous HTML. – Basil Bourque Jul 14 '14 at 06:20
0

Workaround: Use Shorter Element

Not a true answer as it does not improve on the jsoup selector. But it is a practical workaround.

Since the problem is the desired table is also be returned as its outer nesting parent table, then logically we know the desired table’s HTML will also be shorter than the outer table.

So the workaround is comparing the length of each found Element. Use the Element with the shortest HTML length.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154