2

Trying to determine all the valid urls under a given domain without having the mirror the site locally.

People generally want to download all the pages but I just want to get a list of the direct urls under a given domain (e.g. www.example.com), which would be something like

  • www.example.com/page1
  • www.example.com/page2
  • etc.

Is there a way to use wget to do this? or is there a better approach for this?

Kevin Panko
  • 8,356
  • 19
  • 50
  • 61
fccoelho
  • 6,012
  • 10
  • 55
  • 67

2 Answers2

2

Ok, I had to find my own answer:

the tool I use was httrack.

httrack -p0 -r2 -d www.example.com
  • the -p0 option tells it to just scan (not save pages);
  • the -rx option tells it the depth of the search
  • the -d options tells it to stay on the same principal domain

there is even a -%L to add the scanned URL to the specified file but it doesn't seem to work. But that's not a problem, because under the hts-cache directory you can find a TSV file named new.txt containing all the urls visited and some additional information about it. I could extract the URLs from it with the following python code:

with open("hts-cache/new.txt") as f:
    t = csv.DictReader(f,delimiter='\t')
    for l in t:
        print l['URL']
fccoelho
  • 6,012
  • 10
  • 55
  • 67
1

It's unclear if you want to use wget to determine these URLs, but to answer your question regarding not saving the site you could use "--output-document=file" or simply "-O file".

wget -O /dev/null <your-site>

If you have a list of URLs, and want to check if they work you can check for an exit code greater than 0. I.e.

while read URL
do
  wget -O /dev/null $URL >/dev/null 2>&1
  [ $? -gt 0 ] && echo "ERROR retrieving $URL"
done < your-URL-list.txt
sastorsl
  • 2,015
  • 1
  • 16
  • 17
  • No, If I wanted only to check if the pages where there, I could use wget --spider, but I want it to find all of the urls and list them on stdout or in a file. – fccoelho Sep 24 '13 at 19:08