So I'm working on a project in java 1.5 that will move player data from the NFL's web pages into a personal data store. I have the page's source code converted to a string format, and am parsing it for the data I want to pull out. I'm able to get the first chunk of regularly formatted player information out, but am struggling with formatting my pattern to accommodate some unusually structured whitespace. The comments begin where the it stops parsing correctly.
pattern = Pattern.compile(
sTag + "(.*?)" + eTag + "\n"//position 1-group
+sTag + "(.*?)" + eTag + "\n" //number 2
+ "<td><a href=\"(.*?)/profile\">(.*?)</a>" + eTag + "\n" //name 4 (3 not used)
+sTag + "(.*?)" + eTag + "\n" //active status 5
// +"(.*?)" //6
// +sTag + "(.*?)" + eTag + "\n" //tackles 7
// +"(.*?) //8
// +sTag + "(.*?)" + eTag //sacks 9
// +"(.*?) //10
// +sTag + "(.*?)" + eTag //ff 11 (not used)
// +"(.*?) //12
// +sTag + "(.*?)" + eTag //int 13
);
The HTML data I'm trying to parse is formatted as follows:
<td class="tbdy1"><a href="/teams/atlantafalcons/profile?team=ATL">ATL</a></td></tr>
<tr class="even">
<td class="tbdy">SS</td>
<td class="tbdy">20</td>
<td><a href="/player/willallen/2506088/profile">Allen, Will</a></td>
<td class="tbdy">ACT</td>
<td class="ra">
TCKL
</td>
<td class="tbdy">36</td>
<td class="ra">
SCK
</td>
<td class="tbdy">0.0</td>
<td class="ra">
FF
</td>
<td class="tbdy">1</td>
<td class="ra">
INT
</td>
<td class="tbdy">--</td>
Any help?