Regex-ing around whitespace in Java

Question

So I'm working on a project in java 1.5 that will move player data from the NFL's web pages into a personal data store. I have the page's source code converted to a string format, and am parsing it for the data I want to pull out. I'm able to get the first chunk of regularly formatted player information out, but am struggling with formatting my pattern to accommodate some unusually structured whitespace. The comments begin where the it stops parsing correctly.

pattern = Pattern.compile(
                        sTag + "(.*?)" + eTag + "\n"//position 1-group
                        +sTag + "(.*?)" + eTag + "\n" //number 2
                        + "<td><a href=\"(.*?)/profile\">(.*?)</a>" + eTag + "\n" //name 4 (3 not used)
                        +sTag + "(.*?)" + eTag + "\n" //active status 5
//                      +"(.*?)" //6
//                      +sTag + "(.*?)" + eTag + "\n" //tackles 7
//                      +"(.*?) //8
//                      +sTag + "(.*?)" + eTag //sacks 9
//                      +"(.*?) //10
//                      +sTag + "(.*?)" + eTag //ff 11 (not used)
//                      +"(.*?) //12
//                      +sTag + "(.*?)" + eTag //int 13
                        );

The HTML data I'm trying to parse is formatted as follows:

<td class="tbdy1"><a href="/teams/atlantafalcons/profile?team=ATL">ATL</a></td></tr>
<tr class="even">
<td class="tbdy">SS</td>
<td class="tbdy">20</td>
<td><a href="/player/willallen/2506088/profile">Allen, Will</a></td>
<td class="tbdy">ACT</td>
<td class="ra">
                                TCKL
                            </td>
<td class="tbdy">36</td>
<td class="ra">
                                SCK
                            </td>
<td class="tbdy">0.0</td>
<td class="ra">
                                FF
                            </td>
<td class="tbdy">1</td>
<td class="ra">
                                INT
                            </td>
<td class="tbdy">--</td>

Any help?

[time to change tack](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Reimeus, Apr 13 '15 at 22:42
I wish I could! Do you have any suggestions for a different way to attack the problem? I have to parse the code with Java, that's the only requirement as far as I'm aware. — GregorioVasquez, Apr 14 '15 at 13:42

score 0 · Answer 1 · edited May 23 '17 at 12:14

After some digging, I decided to approach the problem a different way. The forum at Removing whitespace from strings in Java showed me how to eliminate all of the whitespace. This made pattern recognition significantly easier. My final set up ended up looking something like this:

            line = line.replaceAll("\\s", "");
            String sTag = "<tdclass=\"tbdy\">";
            String eTag = "</td>";




            Pattern pattern;
            Matcher matcher;                
            pattern = Pattern.compile(
                    // pattern //stat group#
                    sTag + "(.*?)" + eTag //position 1
                    +sTag + "(.*?)" + eTag //number 2
                    + "<td><ahref=\"(.*?)/profile\">(.*?)</a>" + eTag //name 4 (3 not used)
                    +sTag + "(.*?)" + eTag //status 5
                    +"(.*?)" //6
                    +sTag + "(.*?)" + eTag //tackles 7
                    +"(.*?)" //8
                    +sTag + "(.*?)" + eTag //sacks 9
                    +"(.*?)" //10
                    +sTag + "(.*?)" + eTag //ff 11 (not used)
                    +"(.*?)" //12
                    +sTag + "(.*?)" + eTag //int 13
                    );
            System.out.println(" " + matcher.group(1) +" "+ matcher.group(2) + " " + matcher.group(4)+" "+ matcher.group(5)+ " " + matcher.group(7)+ " " + matcher.group(9)+ " " + matcher.group(13));

also changing Pattern.compile([pattern]) to Pattern.compile([pattern],Pattern.DOTALL) would've done the trick. — GregorioVasquez, May 04 '15 at 19:23

Regex-ing around whitespace in Java

1 Answers1