135

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.

Davidmh
  • 3,797
  • 18
  • 35
Jonathan Lyon
  • 3,862
  • 7
  • 39
  • 52

5 Answers5

90

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

gerzenstl
  • 115
  • 7
Hank Gay
  • 70,339
  • 36
  • 160
  • 222
  • thank you so much Hank! Perfect - exactly what I needed. Very much appreciated. – Jonathan Lyon Sep 17 '09 at 15:08
  • 2
    A nice tool. I was using "XENU link sleuth before". Linkchecker is far more verbose. – Mateng Nov 14 '11 at 20:42
  • how do I do that myself? and what if there is no robots.txt in a web site? – Alan Coromano Jul 30 '13 at 17:15
  • 1
    @MariusKavansky How do you manually crawl a website? Or how do you build a crawler? I'm not sure I understand your question. If there is no `robots.txt` file, that just means you can crawl to your heart's content. – Hank Gay Jul 31 '13 at 15:14
  • And this is available in Ubuntu's repository (actually it works with Windows/Mac/Linux) – Adi Fatol Nov 26 '13 at 22:28
  • 1
    Such a great little program! – Arash Saidi Nov 29 '14 at 19:24
  • 9
    hi guys, linkchecker has not worked for me when I scan the site it only returns a report of broken links. Very small report. while it does they it checked thousands of links but I can't see where those are reported. Using version 9.3 can you please help? – Jay Nov 05 '15 at 10:33
  • how to send output to file with `--out` or `-o`? – Pandya Oct 09 '18 at 09:25
  • 3
    Usual command to write to file is `linkchecker https://example.com --file-output=csv --verbose` . Different formats can be chosen too. – laimison Jan 12 '21 at 23:53
  • It did not work. It reported an SSL handshake problem and the execution was terminated. – Redoman Mar 19 '21 at 02:21
64

If you have the developer console (JavaScript) in your browser, you can type this code in:

urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);

Shortened:

n=$$('a');for(u in n)console.log(n[u].href)
ElectroBit
  • 1,152
  • 11
  • 16
  • 1
    What about "Javascript-ed" urls? – Pacerier Feb 25 '15 at 00:56
  • Like what? What do you mean? – ElectroBit Apr 03 '15 at 20:53
  • 2
    I mean a link done using Javascript. Your solution wouldn't show it. – Pacerier Apr 06 '15 at 13:45
  • 2
    @ElectroBit I really like it, but I'm not sure what I'm looking at? What is the `$$` operator? Or is that just [an arbitrary function name,](http://stackoverflow.com/questions/1463867/javascript-double-dollar-sign) same as `n=ABC(''a');` I'm not understanding how `urls` gets all the 'a' tagged elements. Can you explain? I'm assuming its not jQuery. What prototype library function are we talking? – zipzit May 28 '16 at 17:32
  • 1
    @zipzit In a handful of browsers, `$$()` is basically shorthand for `document.querySelectorAll()`. More info at this link: https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll – ElectroBit May 28 '16 at 17:54
  • There is no complete computable solution to traversing javascripted urls beyond some very rudimentary attempts. At least this tip is working with the DOM and not the HTML source. – Lothar Dec 05 '17 at 21:03
  • Sadly, in 2023, this little snippet isn't working for me from Chrome's console. – Tony Ennis Mar 22 '23 at 20:59
6

Another alternative might be

Array.from(document.querySelectorAll("a")).map(x => x.href)

With your $$( its even shorter

Array.from($$("a")).map(x => x.href)
Seb
  • 888
  • 12
  • 20
  • plus 1 - like that you are using modern JS. I ran this program, and while it returned a few links, it didn't return all of the .html pages that are on the top level. Is there a reason why all the pages don't return in the array list? Thanks – Chris22 May 23 '20 at 20:47
0

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

mizubasho
  • 91
  • 1
  • 7
  • 23
    No no no no, [don't parse HTML with regex](http://stackoverflow.com/a/1732454/113632), it makes Baby Jesus cry! – dimo414 May 29 '13 at 05:47
-2
function getalllinks($url) {
    $links = array();
    if ($fp = fopen($url, 'r')) {
        $content = '';
        while ($line = fread($fp, 1024)) {
            $content. = $line;
        }
    }
    $textLen = strlen($content);
    if ($textLen > 10) {
        $startPos = 0;
        $valid = true;
        while ($valid) {
            $spos = strpos($content, '<a ', $startPos);
            if ($spos < $startPos) $valid = false;
            $spos = strpos($content, 'href', $spos);
            $spos = strpos($content, '"', $spos) + 1;
            $epos = strpos($content, '"', $spos);
            $startPos = $epos;
            $link = substr($content, $spos, $epos - $spos);
            if (strpos($link, 'http://') !== false) $links[] = $link;
        }
    }
    return $links;
}

try this code....

Morgoth
  • 4,935
  • 8
  • 40
  • 66
  • 10
    While this answer is probably correct and useful, it is preferred if you include some explanation along with it to explain how it helps to solve the problem. This becomes especially useful in the future, if there is a change (possibly unrelated) that causes it to stop working and users need to understand how it once worked. – Kevin Brown-Silva Mar 06 '15 at 00:12
  • 2
    Eh, it's a little **long.** – ElectroBit May 03 '15 at 18:29
  • 1
    Completely unnecessary to parse the html in this manner in php. http://php.net/manual/en/class.domdocument.php PHP does have the ability to understand the DOM! – JamesH Jun 26 '15 at 12:30