0

So I'm working on a project in java 1.5 that will move player data from the NFL's web pages into a personal data store. I have the page's source code converted to a string format, and am parsing it for the data I want to pull out. I'm able to get the first chunk of regularly formatted player information out, but am struggling with formatting my pattern to accommodate some unusually structured whitespace. The comments begin where the it stops parsing correctly.

pattern = Pattern.compile(
                        sTag + "(.*?)" + eTag + "\n"//position 1-group
                        +sTag + "(.*?)" + eTag + "\n" //number 2
                        + "<td><a href=\"(.*?)/profile\">(.*?)</a>" + eTag + "\n" //name 4 (3 not used)
                        +sTag + "(.*?)" + eTag + "\n" //active status 5
//                      +"(.*?)" //6
//                      +sTag + "(.*?)" + eTag + "\n" //tackles 7
//                      +"(.*?) //8
//                      +sTag + "(.*?)" + eTag //sacks 9
//                      +"(.*?) //10
//                      +sTag + "(.*?)" + eTag //ff 11 (not used)
//                      +"(.*?) //12
//                      +sTag + "(.*?)" + eTag //int 13
                        ); 

The HTML data I'm trying to parse is formatted as follows:

<td class="tbdy1"><a href="/teams/atlantafalcons/profile?team=ATL">ATL</a></td></tr>
<tr class="even">
<td class="tbdy">SS</td>
<td class="tbdy">20</td>
<td><a href="/player/willallen/2506088/profile">Allen, Will</a></td>
<td class="tbdy">ACT</td>
<td class="ra">
                                TCKL
                            </td>
<td class="tbdy">36</td>
<td class="ra">
                                SCK
                            </td>
<td class="tbdy">0.0</td>
<td class="ra">
                                FF
                            </td>
<td class="tbdy">1</td>
<td class="ra">
                                INT
                            </td>
<td class="tbdy">--</td>

Any help?

  • 3
    [time to change tack](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Reimeus Apr 13 '15 at 22:42
  • I wish I could! Do you have any suggestions for a different way to attack the problem? I have to parse the code with Java, that's the only requirement as far as I'm aware. – GregorioVasquez Apr 14 '15 at 13:42
  • [jsoup](http://jsoup.org/) – Reimeus Apr 14 '15 at 13:45

1 Answers1

0

After some digging, I decided to approach the problem a different way. The forum at Removing whitespace from strings in Java showed me how to eliminate all of the whitespace. This made pattern recognition significantly easier. My final set up ended up looking something like this:

            line = line.replaceAll("\\s", "");
            String sTag = "<tdclass=\"tbdy\">";
            String eTag = "</td>";




            Pattern pattern;
            Matcher matcher;                
            pattern = Pattern.compile(
                    // pattern //stat group#
                    sTag + "(.*?)" + eTag //position 1
                    +sTag + "(.*?)" + eTag //number 2
                    + "<td><ahref=\"(.*?)/profile\">(.*?)</a>" + eTag //name 4 (3 not used)
                    +sTag + "(.*?)" + eTag //status 5
                    +"(.*?)" //6
                    +sTag + "(.*?)" + eTag //tackles 7
                    +"(.*?)" //8
                    +sTag + "(.*?)" + eTag //sacks 9
                    +"(.*?)" //10
                    +sTag + "(.*?)" + eTag //ff 11 (not used)
                    +"(.*?)" //12
                    +sTag + "(.*?)" + eTag //int 13
                    );
            System.out.println(" " + matcher.group(1) +" "+ matcher.group(2) + " " + matcher.group(4)+" "+ matcher.group(5)+ " " + matcher.group(7)+ " " + matcher.group(9)+ " " + matcher.group(13));
Community
  • 1
  • 1