Full Link Extraction using java

Question

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example: Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link

You you have access to the Request? – Sylar Aug 03 '10 at 07:17 — Sylar, Aug 03 '10 at 07:17

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

6

You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.

package com.stackoverflow.q3394298;

import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Test {
    
    public static void main(String... args) throws Exception {
        URL url = new URL("https://stackoverflow.com/questions/3394298/");
        Document document = Jsoup.connect(url).get();
        Element link = document.select("a.question-hyperlink").first();
        System.out.println(link.attr("href"));
        System.out.println(link.absUrl("href"));
    }
    
}

which prints (correctly) the following for the title link of your current question:

/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java

Jsoup may have more other (undiscovered) advantages for your purpose as well.

Related questions:

What are the pros and cons of the leading HTML parsers in Java?

Update: if you want to select all links in the document, then do as follows:

        Elements links = document.select("a");
        for (Element link : links) {
            System.out.println(link.attr("href"));
            System.out.println(link.absUrl("href"));
        }

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 03 '10 at 19:21

BalusC

1,082,665
372
3,610
3,555

Can you gave me a full code, i am not that much master in JAVA,will you please help? – Alex Mathew Aug 03 '10 at 19:23
OK, I edited it into a SSCCE flavor so that you can copy'n'paste'n'run it without changes (you just need to drop Jsoup JAR file in classpath). – BalusC Aug 03 '10 at 19:25
its shows an error "Exception in thread "main" java.lang.NullPointerException at Test.main" I gave site http://www.yahoo.com and shows the above error – Alex Mathew Aug 03 '10 at 19:29
You need to select the link(s) of interest. The `"a.question-hyperlink"` basically selects all `` elements which have a `class="question-hyperlink"`. Rightclick page and view source. Here at stackoverflow.com there is one such link. At yahoo.com they don't have them. You need to alter the selector to select exactly those element(s) of interest. If you want to select all links, then use `document.select("a")` and loop over the obtained links. This has been answered in your [other question](http://stackoverflow.com/questions/3386065). – BalusC Aug 03 '10 at 20:15
See my answer this question http://stackoverflow.com/questions/3383152/how-to-find-hyperlink-in-a-webpage-using-java. – naikus Aug 04 '10 at 03:53

score 3 · Answer 2 · answered Aug 03 '10 at 07:14

3

Use the URL object:

URL url = new URL(URL context, String spec)

Here's an example:

import java.net.*;

public class Test {
public static void main(String[] args) throws Exception {
   URL base = new URL("http://www.java.com/dit/index.html");   
   URL url = new URL(base, "../hello.html");

   System.out.println(base);
   System.out.println(url);
}
}

It will print:

http://www.java.com/dit/index.html
http://www.java.com/hello.html

answered Aug 03 '10 at 07:14

naikus

24,302
4
42
43

1

it will not extract link from a URL,because for me, the input is HTML,that is,from a bunch of HTML code it need to extract correct link – Alex Mathew Aug 03 '10 at 19:12
@Alex Mathew, Untrue, you can combine the answer to http://stackoverflow.com/questions/3383152/how-to-find-hyperlink-in-a-webpage-using-java with this answer to do exactly what you intend to do. Pleass put some effort into it. And my guess is that both these questions are for the same problem – naikus Aug 04 '10 at 03:45

Full Link Extraction using java

2 Answers2

Related questions:

Linked