1

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example: Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link

Alex Mathew
  • 1,534
  • 9
  • 31
  • 57

2 Answers2

6

You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.

package com.stackoverflow.q3394298;

import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Test {
    
    public static void main(String... args) throws Exception {
        URL url = new URL("https://stackoverflow.com/questions/3394298/");
        Document document = Jsoup.connect(url).get();
        Element link = document.select("a.question-hyperlink").first();
        System.out.println(link.attr("href"));
        System.out.println(link.absUrl("href"));
    }
    
}

which prints (correctly) the following for the title link of your current question:

/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java

Jsoup may have more other (undiscovered) advantages for your purpose as well.

Related questions:


Update: if you want to select all links in the document, then do as follows:

        Elements links = document.select("a");
        for (Element link : links) {
            System.out.println(link.attr("href"));
            System.out.println(link.absUrl("href"));
        }
Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
3

Use the URL object:

URL url = new URL(URL context, String spec)

Here's an example:

import java.net.*;

public class Test {
public static void main(String[] args) throws Exception {
   URL base = new URL("http://www.java.com/dit/index.html");   
   URL url = new URL(base, "../hello.html");

   System.out.println(base);
   System.out.println(url);
}
}

It will print:

http://www.java.com/dit/index.html
http://www.java.com/hello.html
naikus
  • 24,302
  • 4
  • 42
  • 43
  • 1
    it will not extract link from a URL,because for me, the input is HTML,that is,from a bunch of HTML code it need to extract correct link – Alex Mathew Aug 03 '10 at 19:12
  • @Alex Mathew, Untrue, you can combine the answer to http://stackoverflow.com/questions/3383152/how-to-find-hyperlink-in-a-webpage-using-java with this answer to do exactly what you intend to do. Pleass put some effort into it. And my guess is that both these questions are for the same problem – naikus Aug 04 '10 at 03:45