-1

I have written a program to find all the links in an HTML page:

public static void main(String[] args) throws IOException {
    String base = "http://www.oracle.com/";
    URL url = new URL(base);
    BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

    StringBuffer stringBuffer = new StringBuffer();
    String inputLine = null;
    while ((inputLine = in.readLine()) != null) {
        stringBuffer = stringBuffer.append(inputLine).append("\n");
    }

    Matcher matcher = Pattern.compile("<a .*href=\"([^\"]+)\".*</a>", Pattern.DOTALL).matcher(stringBuffer.toString());

    ArrayList<String> urlList = new ArrayList<>();
    while (matcher.find()){
        String relUrl = matcher.group(1);
        String fullUrl = relUrl.startsWith("/")?base+relUrl.substring(1):relUrl;
        urlList.add(fullUrl);
        System.out.println(fullUrl);
    }

    in.close();
}

For some reason, when I run this code it is only matching one link. However, when I run it without the DOTALL flag, it matches 108 links. The reason I included the DOTALL flag is to match links where the a tag may go over one line, such as:

    <li><a data-lbl="solutions" href="https://www.oracle.com/solutions/index.html#menu-solutions" data-trackas="hnav" class="u01nav">
<h3>Solutions</h3>
</a></li>

According to here, the regex <a .*href=\"([^\"]+)\".*<\/a> matches the HTML above. (this is slightly different than the one I used in the code because Eclipse wouldn't let me escape the / character)

b_pcakes
  • 2,452
  • 3
  • 28
  • 45

1 Answers1

1

Since your regex is greedy .* in your regex matches all the characters. So make it non-greedy .*? ..

"<a .*?href=\"([^\"]+)\".*?</a>"

or

"<a [^<>]*\\bhref=\"([^\"]+)\".*?</a>"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • It works, thank you. Just out of curiosity, why is it that without the `DOTALL` flag, the pattern is able to match 100 or so links? I thought the `DOTALL` flag didn't affect the greediness of the pattern? – b_pcakes Nov 11 '15 at 02:58
  • because there are 100 anchor tags exists on a single line. – Avinash Raj Nov 11 '15 at 03:02
  • Yes, but why for those 100 tags, did the `.*` not match all the characters? – b_pcakes Nov 11 '15 at 03:07
  • may be bacause of \n char exists in between. I'm still not clear about what you mean. Post an example data and the regex you tried on regex101. Then give me the link back. – Avinash Raj Nov 11 '15 at 03:09
  • 1
    @SimonZhu `DOTALL` does not affect greediness, but what it actually does is it allows the dot in `.*` to match newlines. Thus, `.*` will consume the whole page until the last occurrence of `href=...` – Mariano Nov 11 '15 at 03:14