get links in a web site

Question

how can i get links in a web page without loading it? (basically what i want is this. a user enters a URL and i want to load all the available links inside that URL.) can you please tell me a way to achieve this

what do you mean without loading it? you'll have to at least fetch the contents of the URL and process them somehow — NG., Oct 06 '10 at 15:47
@SB i think he means that ,he don't want to make GET request to all of the hyperlinks — jmj, Oct 06 '10 at 15:52
this is how it works. a user comes and enters a URL and i get all the links inside that URL. then i do some processing and show some results to the user. not loading means the user should not see whether his URL is loading or not.(it can load but should not show it to the user) — netha, Oct 06 '10 at 15:57
@netha, first of all, are your working with java or javascript ? It isn't the same thing at all. — Colin Hebert, Oct 06 '10 at 15:59
i'll be happy to get the links using javascript. But if it is impossible then i don't mind getting them from java — netha, Oct 06 '10 at 16:06
@netha for java i have mentioned in the answer, and using javascript it would be too heavy for some complex parsing scenario as it is going to be done on client's browser better would be the java case i guess — jmj, Oct 06 '10 at 16:30
i tried you code and it gave an exception."Exception in thread "main" java.net.SocketException: Network is unreachable:" — netha, Oct 06 '10 at 16:43

score 2 · Answer 1 · edited Jul 14 '14 at 16:11

Here is example Java code, specifically:

import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class Main {
  public static void main(String args[]) throws Exception {
    URL url = new URL(args[0]);
    Reader reader = new InputStreamReader((InputStream) url.getContent());
    System.out.println("<HTML><HEAD><TITLE>Links for " + args[0] + "</TITLE>");
    System.out.println("<BASE HREF=\"" + args[0] + "\"></HEAD>");
    System.out.println("<BODY>");
    new ParserDelegator().parse(reader, new LinkPage(), false);
    System.out.println("</BODY></HTML>");
  }
}

class LinkPage extends HTMLEditorKit.ParserCallback {

  public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    if (t == HTML.Tag.A) {
      System.out.println("<A HREF=\"" + a.getAttribute(HTML.Attribute.HREF) + "\">"
          + a.getAttribute(HTML.Attribute.HREF) + "</A><BR>");
    }
  }

}

@Netha, can you post whole stacktrace, – jmj Oct 07 '10 at 18:25 — jmj, Oct 07 '10 at 18:25

score 0 · Answer 2 · edited Jul 14 '14 at 16:07

public void extract_link(String site)
{
    try {
        List<String> links = extractLinks(site);
        for (String link : links) {
            System.out.println(link);
        }

    } catch (Exception e) {
        System.out.println(e);
    }
}

This is a simple function to view all links in a page. If you want to view link in the inner links , just call it recursively(but make sure you give a limit according to your need).

score 0 · Answer 3 · answered Oct 06 '10 at 15:47

0

You'll have to load the page on your server and then find the links, preferably by loading up the document in an HTML/XML parser and traversing that DOM. The server could then send the links back to the client.

You can't do it on the client because the browser won't let your Javascript code look at the contents of the page from a different domain.

answered Oct 06 '10 at 15:47

Pointy

405,095
59
585
614

1

can you please give me code example or any link to a resource where i can study a bit about it – netha Oct 06 '10 at 16:09
It completely depends on what sort of server-side environment you've got. There are many, many possibilities. – Pointy Oct 06 '10 at 16:50

score 0 · Answer 4 · edited May 23 '17 at 10:28

0

If you want the content of a page you'll have to load it. But what you can do is loading it in memory and parse it to get all the <a> tags and their content.

You'll be able to parse this XML with tools like JDom or Sax if you're working with java (as your tag says) or with simple DOM tools with javascript.

Resources :

Parse XML with javascript

On the same topic :

get all the href attributes of a web site (javascript)

edited May 23 '17 at 10:28

Community

1
1

answered Oct 06 '10 at 15:47

Colin Hebert

91,525
15
160
151

@Paddy, you're right, and in this case the best thing to do is looking right to the ` – Colin Hebert Oct 06 '10 at 15:52

score 0 · Answer 5 · answered Oct 06 '10 at 15:48

0

Just open an URLConnection, gets the page and parse it.

answered Oct 06 '10 at 15:48

Spilarix

1,418
1
13
24

get links in a web site

5 Answers5