How to find all links / pages on a website

Question

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.

crawlmysite.in - site not exists – Sarah Trees Oct 20 '15 at 07:40 — Sarah Trees, Oct 20 '15 at 07:40

score 90 · Accepted Answer · edited Jun 08 '21 at 23:50

90

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

edited Jun 08 '21 at 23:50

gerzenstl

115
7

answered Sep 17 '09 at 14:51

Hank Gay

70,339
36
160
222

thank you so much Hank! Perfect - exactly what I needed. Very much appreciated. – Jonathan Lyon Sep 17 '09 at 15:08
2

A nice tool. I was using "XENU link sleuth before". Linkchecker is far more verbose. – Mateng Nov 14 '11 at 20:42
how do I do that myself? and what if there is no robots.txt in a web site? – Alan Coromano Jul 30 '13 at 17:15
1

@MariusKavansky How do you manually crawl a website? Or how do you build a crawler? I'm not sure I understand your question. If there is no `robots.txt` file, that just means you can crawl to your heart's content. – Hank Gay Jul 31 '13 at 15:14
And this is available in Ubuntu's repository (actually it works with Windows/Mac/Linux) – Adi Fatol Nov 26 '13 at 22:28
1

Such a great little program! – Arash Saidi Nov 29 '14 at 19:24
9

hi guys, linkchecker has not worked for me when I scan the site it only returns a report of broken links. Very small report. while it does they it checked thousands of links but I can't see where those are reported. Using version 9.3 can you please help? – Jay Nov 05 '15 at 10:33
how to send output to file with `--out` or `-o`? – Pandya Oct 09 '18 at 09:25
3

Usual command to write to file is `linkchecker https://example.com --file-output=csv --verbose` . Different formats can be chosen too. – laimison Jan 12 '21 at 23:53
It did not work. It reported an SSL handshake problem and the execution was terminated. – Redoman Mar 19 '21 at 02:21

ElectroBit · Answer 2 · 2016-05-28T20:15:54.673

64

If you have the developer console (JavaScript) in your browser, you can type this code in:

urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);

Shortened:

n=$$('a');for(u in n)console.log(n[u].href)

edited May 28 '16 at 20:15

answered Jan 05 '15 at 22:03

ElectroBit

1,152
11
16

1

What about "Javascript-ed" urls? – Pacerier Feb 25 '15 at 00:56
Like what? What do you mean? – ElectroBit Apr 03 '15 at 20:53
2

I mean a link done using Javascript. Your solution wouldn't show it. – Pacerier Apr 06 '15 at 13:45
2

@ElectroBit I really like it, but I'm not sure what I'm looking at? What is the `$$` operator? Or is that just [an arbitrary function name,](http://stackoverflow.com/questions/1463867/javascript-double-dollar-sign) same as `n=ABC(''a');` I'm not understanding how `urls` gets all the 'a' tagged elements. Can you explain? I'm assuming its not jQuery. What prototype library function are we talking? – zipzit May 28 '16 at 17:32
1

@zipzit In a handful of browsers, `$$()` is basically shorthand for `document.querySelectorAll()`. More info at this link: https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll – ElectroBit May 28 '16 at 17:54
There is no complete computable solution to traversing javascripted urls beyond some very rudimentary attempts. At least this tip is working with the DOM and not the HTML source. – Lothar Dec 05 '17 at 21:03
Sadly, in 2023, this little snippet isn't working for me from Chrome's console. – Tony Ennis Mar 22 '23 at 20:59

score 6 · Answer 3 · answered Mar 01 '20 at 19:00

6

Another alternative might be

Array.from(document.querySelectorAll("a")).map(x => x.href)

With your $$( its even shorter

Array.from($$("a")).map(x => x.href)

answered Mar 01 '20 at 19:00

Seb

888
12
20

plus 1 - like that you are using modern JS. I ran this program, and while it returned a few links, it didn't return all of the .html pages that are on the top level. Is there a reason why all the pages don't return in the array list? Thanks – Chris22 May 23 '20 at 20:47

mizubasho · Answer 4 · 2009-09-17T15:25:59.867

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

No no no no, [don't parse HTML with regex](http://stackoverflow.com/a/1732454/113632), it makes Baby Jesus cry! — dimo414, May 29 '13 at 05:47

score -2 · Answer 5 · edited Nov 25 '19 at 18:25

-2

function getalllinks($url) {
    $links = array();
    if ($fp = fopen($url, 'r')) {
        $content = '';
        while ($line = fread($fp, 1024)) {
            $content. = $line;
        }
    }
    $textLen = strlen($content);
    if ($textLen > 10) {
        $startPos = 0;
        $valid = true;
        while ($valid) {
            $spos = strpos($content, '<a ', $startPos);
            if ($spos < $startPos) $valid = false;
            $spos = strpos($content, 'href', $spos);
            $spos = strpos($content, '"', $spos) + 1;
            $epos = strpos($content, '"', $spos);
            $startPos = $epos;
            $link = substr($content, $spos, $epos - $spos);
            if (strpos($link, 'http://') !== false) $links[] = $link;
        }
    }
    return $links;
}

try this code....

edited Nov 25 '19 at 18:25

Morgoth

4,935
8
40
66

answered Dec 03 '14 at 07:42

user4318981

38
3

10

While this answer is probably correct and useful, it is preferred if you include some explanation along with it to explain how it helps to solve the problem. This becomes especially useful in the future, if there is a change (possibly unrelated) that causes it to stop working and users need to understand how it once worked. – Kevin Brown-Silva Mar 06 '15 at 00:12
2

Eh, it's a little **long.** – ElectroBit May 03 '15 at 18:29
1

Completely unnecessary to parse the html in this manner in php. http://php.net/manual/en/class.domdocument.php PHP does have the ability to understand the DOM! – JamesH Jun 26 '15 at 12:30

How to find all links / pages on a website

5 Answers5

Linked