I have written a program to find all the links in an HTML page:
public static void main(String[] args) throws IOException {
String base = "http://www.oracle.com/";
URL url = new URL(base);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuffer stringBuffer = new StringBuffer();
String inputLine = null;
while ((inputLine = in.readLine()) != null) {
stringBuffer = stringBuffer.append(inputLine).append("\n");
}
Matcher matcher = Pattern.compile("<a .*href=\"([^\"]+)\".*</a>", Pattern.DOTALL).matcher(stringBuffer.toString());
ArrayList<String> urlList = new ArrayList<>();
while (matcher.find()){
String relUrl = matcher.group(1);
String fullUrl = relUrl.startsWith("/")?base+relUrl.substring(1):relUrl;
urlList.add(fullUrl);
System.out.println(fullUrl);
}
in.close();
}
For some reason, when I run this code it is only matching one link. However, when I run it without the DOTALL
flag, it matches 108 links. The reason I included the DOTALL
flag is to match links where the a
tag may go over one line, such as:
<li><a data-lbl="solutions" href="https://www.oracle.com/solutions/index.html#menu-solutions" data-trackas="hnav" class="u01nav">
<h3>Solutions</h3>
</a></li>
According to here, the regex <a .*href=\"([^\"]+)\".*<\/a>
matches the HTML above. (this is slightly different than the one I used in the code because Eclipse wouldn't let me escape the /
character)