1

I've got the following Regular Expression to extract links out of an HTML document, using java.util.regex

 <a\s.*?href=([^ >]+).*?<img\s.*?src=([^ ]+)(.*?>.*?<\/a>)

and suspect it to match the last link in this markup.

<font size="4">Mail : </font><a href="mailto:c.bantz@pgt-gmbh.com"><u><font size="4" color="#0000ff">s.weber@pgt-gmbh.com</font></u></a><br />
<br />
<font size="4">Internet : </font><a href="http://www.pgt-gmbh.com/"><u><font size="4" color="#0000ff">http://www.pgt-gmbh.com</font></u></a><font size="4"> </font><br />
<br />
<br />
<font size="4"> </font><a class="domino-attachment-link" style="display: inline-block; text-align: center" href="/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/$FILE/Anfrage.pdf" title="Anfrage.pdf"><img src="/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/f_Text/0.5F66?OpenElement&FieldElemFormat=gif" width="32" height="32" alt="Anfrage.pdf" border="0" /> - Anfrage.pdf</a>

But it doesn't match the link but does something like a greedy search, starting with the mailto: and ending with the last link. The same expression works fine with the regex tester within http://regex101.com.

Any hints?

Mr. Polywhirl
  • 42,981
  • 12
  • 84
  • 132

1 Answers1

0

The problem would not occur if newlines are at the end of the text lines.

Now I have an explanation: The <a href="mailto is matched by the regular expression <a\s.*?href=([^ >]+). The following .*? will match any character sequence (without line breaks) until it finds <img.... And it does exactly this (in absence of line breaks).

Example (one with and one without newlines):

private static final Pattern P = Pattern.compile("<a\\s.*?href=([^ >]+).*?<img\\s.*?src=([^ ]+)(.*?>.*?<\\/a>)");
private static final String TEXT = "<font size=\"4\">Mail              : </font><a href=\"mailto:c.bantz@pgt-gmbh.com\"><u><font size=\"4\" color=\"#0000ff\">s.weber@pgt-gmbh.com</font></u></a><br />"
    + "<br />"
    + "<font size=\"4\">Internet        : </font><a href=\"http://www.pgt-gmbh.com/\"><u><font size=\"4\" color=\"#0000ff\">http://www.pgt-gmbh.com</font></u></a><font size=\"4\"> </font><br />"
    + "<br />"
    + "<br />"
    + "<font size=\"4\"> </font><a class=\"domino-attachment-link\" style=\"display: inline-block; text-align: center\" href=\"/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/$FILE/Anfrage.pdf\" title=\"Anfrage.pdf\"><img src=\"/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/f_Text/0.5F66?OpenElement&FieldElemFormat=gif\" width=\"32\" height=\"32\" alt=\"Anfrage.pdf\" border=\"0\" /> - Anfrage.pdf</a>";
private static final String NEWLINE_TEXT = "<font size=\"4\">Mail              : </font><a href=\"mailto:c.bantz@pgt-gmbh.com\"><u><font size=\"4\" color=\"#0000ff\">s.weber@pgt-gmbh.com</font></u></a><br />\n"
    + "<br />\n"
    + "<font size=\"4\">Internet        : </font><a href=\"http://www.pgt-gmbh.com/\"><u><font size=\"4\" color=\"#0000ff\">http://www.pgt-gmbh.com</font></u></a><font size=\"4\"> </font><br />\n"
    + "<br />\n"
    + "<br />\n"
    + "<font size=\"4\"> </font><a class=\"domino-attachment-link\" style=\"display: inline-block; text-align: center\" href=\"/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/$FILE/Anfrage.pdf\" title=\"Anfrage.pdf\"><img src=\"/_dv/_dv/documents_DE.nsf/0/7fadd8be280a2e34c1257dfd00307098/f_Text/0.5F66?OpenElement&FieldElemFormat=gif\" width=\"32\" height=\"32\" alt=\"Anfrage.pdf\" border=\"0\" /> - Anfrage.pdf</a>";

public static void main(String[] args) {
    Matcher m = P.matcher(TEXT);
    if (m.find()) {
        System.out.println(m.group());
    }
    m = P.matcher(NEWLINE_TEXT);
    if (m.find()) {
        System.out.println(m.group());
    }
}

Output:

<a href="mailto:c.bantz@pgt-gmbh.com">... without newlines

<a class="domino-attachment-link"... with newlines

A better pattern

<a\s[^>]*?href=([^>]+)><img\s.*?src=([^ ]+)(.*?>.*?<\/a>)

The problem with HTML and regex is that the upper pattern matches only a specific situation, if some markup is between <a...> and <img...> then it would fail. Surely this could be fixed, but the expression gets more and more incomprehensible.

So: If you want to do this extraction issues for more than one link, you should switch to an HTML-Parser (although finding the best is a science of its own).

CoronA
  • 7,717
  • 2
  • 26
  • 53
  • Thank you, that makes it clear. I've just found out, there is `Pattern.DOTALL` which will even match including linebreaks. Have you also a tip, how I can change the pattern so that it only includes links with an ` – terribleherbst Mar 06 '15 at 12:29
  • I just edited my answer and added a working pattern. But extracting html with regular expressions is really malicious. This pattern might work in one case and another special case would fail. Therefore a html parser is the best choice. – CoronA Mar 06 '15 at 12:38