0

I'm looking to pull out a specific HTML a tag from some HTML that contains a specific date.

The HTML supplied to this in the unit test is:

Here is the Unit Test in question:

public void testParseBasePage(){
    defenseGovContractsParser a = new defenseGovContractsParser("060613");
    String expected = "http://www.defense.gov/contracts/contract.aspx?contractid=5059";
    String result = a.parseBasePage("<td><a id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lnkPressItem\" title=\"Click for Contracts for June 06, 2013\" class=\"Link12\" href=\"http://www.defense.gov/contracts/contract.aspx?contractid=5059\">Contracts for June 06, 2013</a><span id=\"ctl00_ContentPlaceHolder_Body_ContractSummary_dgPRItems_ctl02_lblSubTitle\" class=\"MoreNews3a\"></span></td>");
    assertEquals(expected,result);
}

Here's the code in question:

public String parseBasePage(String HTML) {
    String contractUrl;
    String yr = date.substring(4, 6);
    String day = date.substring(2, 4);
    String month = getMonthForInt(Integer.parseInt(date.substring(0, 2)));
    Pattern getLink = Pattern.compile("<a.*?" + month + ".*?" + day + ".*?20" + yr + ".*?>");
    Matcher match = getLink.matcher(HTML);
    String link = match.group();
    contractUrl = link.substring(link.indexOf("href") + 6);
    contractUrl = contractUrl.replaceFirst("\">", "");
    return contractUrl;
}

private String getMonthForInt(int m) {
    String month = "invalid";
    m = m - 1;
    DateFormatSymbols dfs = new DateFormatSymbols();
    String[] months = dfs.getMonths();
    if (m >= 0 && m <= 11) {
        month = months[m];
    }
    return month;
}

The resulting regex is:

<a.*?June.*?06.*?2013.*?>

which, when I use any online regex tester, matches as expected

NolanPower
  • 409
  • 3
  • 11
  • 4
    Have you seen [this](http://jsoup.org/) and/or [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?answertab=oldest#tab-top)? – Reimeus Jun 07 '13 at 16:03
  • Getting to read that monologue was worth this question without an introduction to jsoup. I'll use jsoup and not consume all living tissue in the world. – NolanPower Jun 07 '13 at 16:33

1 Answers1

4

I would really recommend a decent HTML parser such as JSoup or JTidy (perhaps confusingly named in this scenario), rather than use regepxs for this purpose.

For all but the simplest cases regexps will not work on HTML, and a decent HTML parser is going to be a much more reliable solution.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • Just for anybody who sees this. The actual mistake in this code that caused it not to work is that I never invoked match.find() before calling match.group(). – NolanPower Jun 07 '13 at 17:25