I need some help parsing this html with JSoup. I'm trying to get the data values from each column in the table. I've been looking at the JSoup docs, trying to figure out what exactly I need to do, but still not sure about it. It looks like the website uses a combination of CSS and inline formatting; much of which could be converted to CSS and reduce the page size.
This is a small snippet of the html file (it's actually almost 5 MB's in size).
<html>
<head>
</head>
<body>
<table>
<tr>
<td> </td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td>
<div id="plyrRankings" style="overflow: scroll; overflow-x: hidden;">
<table id="u868top" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
<tr>
<td class="legend titlesmall" bgcolor="#000000" align="left" height="60">#</td>
</tr>
</table>
<table id="u868" width="868" bgcolor="#C8C8C8" cellspacing="0" cellpadding="0" border="0">
<caption style="display:none">
Live ATP Ranking
</caption>
<thead>
<tr class="legend" bgcolor="#000000">
<td colspan="14" height="4"></td>
</tr>
<tr>
<td colspan="14" height="1"></td>
</tr>
<tr class="tbhead">
<td><b>#</b></td>
<td><b>CH</b></td>
<td><b>Player Name</b></td>
<td><b>Age</b></td>
<td><b>Ctry</b></td>
<td class="title" align="left" colspan="1" height="30" width="50" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByPosition();underlineHeaderColumn(5);"><b>Pts</b></td>
<td class="title" align="center" colspan="2" height="30" width="30" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(3);underlineHeaderColumn(6);"><b>+/-</b></td>
<td class="title hdcol" align="center" colspan="1" height="30" width="320" onMouseOver="this.className='title2';this.style.cursor='Pointer';" onMouseOut="this.className='title';this.style.cursor='Default'" onclick="sortByColumn(4);underlineHeaderColumn(7);"><b>Current Tournament</b></td>
<td class="title hdcol" align="center" height="30" width="320"><b>Previous Tournaments</b></td>
<td class="title shcol" align="center" height="30" width="320"><b>Current Tournament</b></td>
<td><b>Next Pts</b></td>
<td><b>Max Pts</b></td>
</tr>
<tr class="tbhead">
<td height="1" width="400" colspan="3"></td>
<td height="1" align="right" width="120" colspan="11"></td>
</tr>
<tr>
<td></td>
</tr>
</thead>
<tbody>
<tr bgColor="white" class="ESP">
<td width=20 height=30> 1 </td>
<td width=20><b class="smalltxt"> </b><b class="chigh"> CH </b><b class="smalltxt"> </b></td>
<td>
<div class="spr esp"></div>
</td>
<td width=150>Rafael Nadal</td>
<td width=50>31<span style="font-size:66%">.6</span></td>
<td width=80>ESP<span style="font-size:66%">1</span></td>
<td width=50>9580</td>
<td align="center">-</td>
<td align="center"><b class="smallred">-1020</b></td>
<td class="hdcol" align="center" width=320>Australian Open R16<br> (R32
<a href="" onclick="playVideo('6i9o76bE4vM' );return false;"> <img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
<td class="hdcol" align="center" width=320>-</td>
<td class="shcol" align="center" width=320>Australian Open R16<br> (R32
<a href="" onclick="playVideo('6i9o76bE4vM' );return false;"> <img width=20 src="/youtube-logo-play-icon.png" style="vertical-align:middle;margin-top:-2px";></a>)</td>
<td width=50>9760</td>
<td width=50>11400</td>
</tr>
<tr>
<td colspan=14 height=1></td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is my Parse class
public static class Parse {
public static ArrayList<Player> playerList(Document doc) {
ArrayList<Player> players = new ArrayList();
try {
Elements trs = doc.select("tbody tr");
for (Element tr : trs) {
Elements tds = tr.getElementsByTag("td");
Element td = tds.first();
System.out.println("Blog: " + td.text());
}
} catch (Exception e) {
e.printStackTrace();
}
return players;
}
}
Update: I've updated the source code to show the structure of the html more accurately. I had assumed that it was a given that tbody would be inside a table element. I guess I was wrong, sorry about that.