1

I am trying to extract all links from an HTML file using Java.

The pattern seems to be <a href = "Name">. I would like to obtain the URL that would enable me to access the desired webpage.

Can you guys help me out with an approach (string.contains? string.indexof?)?

Thank you.

Pshemo
  • 122,468
  • 25
  • 185
  • 269
user2999870
  • 345
  • 4
  • 12
  • 3
    Use parser like [jsoup](http://jsoup.org/). This way you can just call `document.select("a")` and get all links. Also visit http://jsoup.org/cookbook/extracting-data/selector-syntax for more info about select syntax to specify what can appear in `href` attribute. – Pshemo Jan 10 '15 at 03:08
  • possible duplicate of [Extract links from a web page](http://stackoverflow.com/questions/5120171/extract-links-from-a-web-page) – Joe Jan 10 '15 at 12:19

1 Answers1

3

A basic fundamentals approach would be to use regex matching.

    String html = "YOUR HTML";
    String regex = "<a href\\s?=\\s?\"([^\"]+)\">";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(html);
    int index = 0;
    while (matcher.find(index)) {
        String wholething = matcher.group(); // includes "<a href" and ">"
        String link = matcher.group(1); // just the link
        // do something with wholething or link.
        index = matcher.end();
    }

On the other hand, you could use something like Document. I don't know much about this.

k_g
  • 4,333
  • 2
  • 25
  • 40