0

I need some help parsing this html with JSoup. I'm trying to get the data values from each column in the table. I've been looking at the JSoup docs, trying to figure out what exactly I need to do, but still not sure about it. It looks like the website uses a combination of CSS and inline formatting; much of which could be converted to CSS and reduce the page size.

This is a small snippet of the html file (it's actually almost 5 MB's in size).

<html>

<head>
</head>

<body>
  <table>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>
        <div id="plyrRankings" style="overflow: scroll; overflow-x: hidden;">
          <table id="u868top" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
            <tr>
              <td class="legend titlesmall" bgcolor="#000000" align="left" height="60">#</td>
            </tr>
          </table>
          <table id="u868" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
            <caption style="display:none">
              Live ATP Ranking
            </caption>
            <thead>
              <tr class="legend" bgcolor="#000000">
                <td colspan="14" height="4"></td>
              </tr>
              <tr>
                <td colspan="14" height="1"></td>
              </tr>
              <tr class="tbhead">
                <td><b>#</b></td>
                <td><b>CH</b></td>
                <td><b>Player Name</b></td>
                <td><b>Age</b></td>
                <td><b>Ctry</b></td>
                <td class="title" align="left" colspan="1" height="30" width="50" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByPosition();underlineHeaderColumn(5);"><b>Pts</b></td>
                <td class="title" align="center" colspan="2" height="30" width="30" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(3);underlineHeaderColumn(6);"><b>+/-</b></td>
                <td class="title hdcol" align="center" colspan="1" height="30" width="320" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(4);underlineHeaderColumn(7);"><b>Current Tournament</b></td>
                <td class="title hdcol" align="center" height="30" width="320"><b>Previous Tournaments</b></td>
                <td class="title shcol" align="center" height="30" width="320"><b>Current Tournament</b></td>
                <td><b>Next Pts</b></td>
                <td><b>Max Pts</b></td>
              </tr>
              <tr class="tbhead">
                <td height="1" width="400" colspan="3"></td>
                <td height="1" align="right" width="120" colspan="11"></td>
              </tr>
              <tr>
                <td></td>
              </tr>
            </thead>
            <tbody>
              <tr bgColor="white" class="ESP">
                <td width=20 height=30>&nbsp;1&nbsp;</td>
                <td width=20><b class="smalltxt">&nbsp;&nbsp;</b><b class="chigh">&nbsp;CH&nbsp;</b><b class="smalltxt">&nbsp;&nbsp;&nbsp;</b></td>
                <td>
                  <div class="spr esp"></div>
                </td>
                <td width=150>Rafael Nadal</td>
                <td width=50>31<span style="font-size:66%">.6</span></td>
                <td width=80>ESP<span style="font-size:66%">1</span></td>
                <td width=50>9580</td>
                <td align="center">-</td>
                <td align="center"><b class="smallred">-1020</b></td>
                <td class="hdcol" align="center" width=320>Australian Open R16<br> (R32&nbsp;
                  <a href="" onclick="playVideo('6i9o76bE4vM' );return false;">&nbsp;<img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
                <td class="hdcol" align="center" width=320>-</td>
                <td class="shcol" align="center" width=320>Australian Open R16<br> (R32&nbsp;
                  <a href="" onclick="playVideo('6i9o76bE4vM' );return false;">&nbsp;<img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
                <td width=50>9760</td>
                <td width=50>11400</td>
              </tr>
              <tr>
                <td colspan=14 height=1></td>
              </tr>
            </tbody>
          </table>
        </div>
      </td>
    </tr>
  </table>
</body>

</html>

Here is my Parse class

public static class Parse {

    public static ArrayList<Player> playerList(Document doc) {

        ArrayList<Player> players = new ArrayList();

        try {
            Elements trs = doc.select("tbody tr");                

            for (Element tr : trs) {
                Elements tds = tr.getElementsByTag("td");
                Element td = tds.first();
                System.out.println("Blog: " + td.text());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return players;
    }
}

Update: I've updated the source code to show the structure of the html more accurately. I had assumed that it was a given that tbody would be inside a table element. I guess I was wrong, sorry about that.

Ziggy
  • 23
  • 5
  • To select and output css elements inside table elements refer to question and please mark if it helps you https://stackoverflow.com/questions/29326901/converting-window-openhyperlink-javascript-code-to-pure-absolute-url-with-java – PHPFan Jan 21 '18 at 12:46

2 Answers2

0

So I had some difficulty parsing the snippet you supplied because of the missing table element tag but once I added that I was able to get the text in each column using the following logic:

public static void main(String args[]) {

    String html = "<html> <head></head> <body> <table>\n" +
            "<tbody>\n" +
            "<tr bgColor=\"white\" class=\"ESP\">\n" +
            "    <td width=20 height=30>&nbsp;1&nbsp;</td>\n" +
            "    <td width=20><b class=\"smalltxt\">&nbsp;&nbsp;</b><b class=\"chigh\">&nbsp;CH&nbsp;</b><b class=\"smalltxt\">&nbsp;&nbsp;&nbsp;</b></td> <td><div class=\"spr esp\"></div></td> \n" +
            "    <td width=150>Rafael Nadal</td> \n" +
            "    <td width=50>31<span style=\"font-size:66%\">.6</span></td> \n" +
            "    <td width=80>ESP<span style=\"font-size:66%\">1</span></td> \n" +
            "    <td width=50>9580</td> <td align=\"center\">-</td> \n" +
            "    <td align=\"center\"><b class=\"smallred\">-1020</b></td> \n" +
            "    <td class=\"hdcol\" align=\"center\" width=320>Australian Open R16<br> (R32&nbsp;<a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" >&nbsp;<img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
            "    <td class=\"hdcol\" align=\"center\" width=320>-</td> <td class=\"shcol\" align=\"center\" width=320>Australian Open R16<br> (R32&nbsp;<a href=\"\" onclick=\"playVideo('6i9o76bE4vM' );return false;\" >&nbsp;<img width=20 src=\"/youtube-logo-play-icon.png\" style=\"vertical-align:middle;margin-top:-2px\";></a>)</td> \n" +
            "    <td width=50>9760</td> <td width=50>11400</td> \n" +
            "</tr>\n" +
            "</tbody>\n" +
            "</table>\n" +
            "</body>\n" +
            "</html>";

    Document document = Jsoup.parse(html);

    Elements data = document.select("body > table > tbody > tr > td");

    for (Element value : data) {
        System.out.println(value.text());
    }
}
Emma
  • 1
  • 1
0

This code will successfully read the table contents from the HTML provided in your question:

String html = "your html";

Document doc = Jsoup.parse(html);

try {
    // select the table
    Elements table = doc.select("table");
    // select all rows in the table
    Elements trs = table.select("tr");

    for (Element tr : trs) {
        // select all cells in this row
        Elements tds = tr.getElementsByTag("td");
        for (Element td : tds) {
            // print out the cell content
            System.out.println(td.text());
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

Given the HTML provided in your question this code will print:

 1 
   CH    

Rafael Nadal
31.6
ESP1
9580
-
-1020
Australian Open R16 (R32  )
-
Australian Open R16 (R32  )
9760
11400
glytching
  • 44,936
  • 9
  • 114
  • 120