Downloading all Urls accessible under a given domain with wget without saving the actual pages?

Question

Trying to determine all the valid urls under a given domain without having the mirror the site locally.

People generally want to download all the pages but I just want to get a list of the direct urls under a given domain (e.g. www.example.com), which would be something like

www.example.com/page1
www.example.com/page2
etc.

Is there a way to use wget to do this? or is there a better approach for this?

See e.g. http://stackoverflow.com/questions/1439326/how-to-find-all-links-pages-on-a-website — Fredrik Pihl, Sep 24 '13 at 18:50
Do you have any example of your output, and what you want to display? — sastorsl, Sep 24 '13 at 19:11

score 2 · Answer 1 · answered Sep 24 '13 at 19:47

Ok, I had to find my own answer:

the tool I use was httrack.

httrack -p0 -r2 -d www.example.com

the -p0 option tells it to just scan (not save pages);
the -rx option tells it the depth of the search
the -d options tells it to stay on the same principal domain

there is even a -%L to add the scanned URL to the specified file but it doesn't seem to work. But that's not a problem, because under the hts-cache directory you can find a TSV file named new.txt containing all the urls visited and some additional information about it. I could extract the URLs from it with the following python code:

with open("hts-cache/new.txt") as f:
    t = csv.DictReader(f,delimiter='\t')
    for l in t:
        print l['URL']

score 1 · Answer 2 · answered Sep 24 '13 at 18:59

1

It's unclear if you want to use wget to determine these URLs, but to answer your question regarding not saving the site you could use "--output-document=file" or simply "-O file".

wget -O /dev/null <your-site>

If you have a list of URLs, and want to check if they work you can check for an exit code greater than 0. I.e.

while read URL
do
  wget -O /dev/null $URL >/dev/null 2>&1
  [ $? -gt 0 ] && echo "ERROR retrieving $URL"
done < your-URL-list.txt

answered Sep 24 '13 at 18:59

sastorsl

2,015
1
16
17

No, If I wanted only to check if the pages where there, I could use wget --spider, but I want it to find all of the urls and list them on stdout or in a file. – fccoelho Sep 24 '13 at 19:08

Downloading all Urls accessible under a given domain with wget without saving the actual pages?

2 Answers2