How can I extract all links (href) in an HTML file?

Question

I am trying to extract all links from an HTML file using Java.

The pattern seems to be <a href = "Name">. I would like to obtain the URL that would enable me to access the desired webpage.

Can you guys help me out with an approach (string.contains? string.indexof?)?

Thank you.

Use parser like [jsoup](http://jsoup.org/). This way you can just call `document.select("a")` and get all links. Also visit http://jsoup.org/cookbook/extracting-data/selector-syntax for more info about select syntax to specify what can appear in `href` attribute. — Pshemo, Jan 10 '15 at 03:08
possible duplicate of [Extract links from a web page](http://stackoverflow.com/questions/5120171/extract-links-from-a-web-page) — Joe, Jan 10 '15 at 12:19

score 3 · Accepted Answer · answered Jan 10 '15 at 03:28

A basic fundamentals approach would be to use regex matching.

    String html = "YOUR HTML";
    String regex = "<a href\\s?=\\s?\"([^\"]+)\">";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(html);
    int index = 0;
    while (matcher.find(index)) {
        String wholething = matcher.group(); // includes "<a href" and ">"
        String link = matcher.group(1); // just the link
        // do something with wholething or link.
        index = matcher.end();
    }

On the other hand, you could use something like Document. I don't know much about this.

How can I extract all links (href) in an HTML file?

1 Answers1