-1

I am making an Android application that can fetch the new announcements from the website of my university.

This is the HTML code in the website:

sample_html_code http://img690.imageshack.us/img690/1079/88210050.png

Text version:

<table border="1" width="90%" class="duyuru">
<tbody>
<tr>
<td>
<h3 class="duyuru">Additional Quotas for the Technical Electives</h3>
"19/09/2012"
<h4 class="duyuru">"Additional Quotas for Technical Electives offered in...</h4>
<span class="duyuru"></span>
<br>
<a href="news_image/96.doc">Download</a>
</td>
</tr>
</tbody>
</table>

I can get the first and third lines "Additional Quotas for Technical Electives" and "Additional Quotas for ..." by using the piece of code below. However, I cannot get the date information (19/09/2012) located between h3 and h4 lines.

String patternStr ="\\<h3 class=\"duyuru\".*?\\>(.*?)\\</h3\\>";
patternStr+="(.*?)";     // This line is problematic
patternStr+=".*?\\<h4 class=\"duyuru\".*?\\>(.*?)\\</h4\\>";

Pattern pattern = Pattern.compile(patternStr, Pattern.DOTALL);
Matcher matcher = pattern.matcher(content);

String name = "";
String date = "";
String details = "";

while (matcher.find()){

    name    = matcher.group(1);
    date = matcher.group(2);
    details = matcher.group(3);

    Announcement announcement = new Announcement();

    announcement.setName(name);
    announcement.setDate(date);
    announcement.setDetails(details);

    announcements.add(announcement);
}

I tried using

.*?\"(.*?)\"

but it didn't work. When I do this, it gets the string "duyuru" from the line starting with h4 tag instead of the date information.

Anyone have an idea how can I grab the date information?

Thanks in advance.

Ercument Kisa
  • 168
  • 1
  • 3
  • 12
  • Always keep this in mind: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. But I'll have a look at it anyway. Btw: it typically helps if you include the HTML in a format that we can copy/paste easily too. – Frank van Puffelen Dec 08 '12 at 13:37
  • It seems you're simply not matching the newlines: http://fiddle.re/t9m5 – Frank van Puffelen Dec 08 '12 at 13:44

1 Answers1

1

Your regular expression misses the newlines and whitespace in the input.

The simplest possible match I could come up with is:

"\\<h3 class=\"duyuru\".*?\\>\\n?\\s*(.*?)\\n?\\s*\\</h3\\>"

But keep in mind that such a regular expression is highly specific to your HTML.

My advice would be to have a look at a real HTML parser for Java, such as TagSoup. Once you start using one of those, parsing this type of HTML document becomes a breeze.

Frank van Puffelen
  • 565,676
  • 79
  • 828
  • 807