5

I'm trying to crawl a local site with wget -r but I'm unsuccessful: it just downloads the first page and doesn't go any deeper. By the way, I'm so unsuccessful that for whatever site I'm trying it doesn't work... :)

I've tried various options but nothing better happens. Here's the command I thought I'd make it with:

wget -r -e robots=off --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/537.4" --follow-tags=a,ref --debug `http://rocky:8081/obix`

Really, I've no clue. Whatever site or documentation I read about wget tells me that it should simply work with wget -r so I'm starting to think my wget is buggy (I'm on Fedora 16).

Any idea?


EDIT: Here's the output I'm getting for wget -r --follow-tags=ref,a http://rocky:8081/obix/ :

wget -r --follow-tags=ref,a http://rocky:8081/obix/ --2012-10-19 09:29:51-- http://rocky:8081/obix/ Resolving rocky... 127.0.0.1 Connecting to rocky|127.0.0.1|:8081... connected. HTTP request sent, awaiting response... 200 OK Length: 792 [text/xml] Saving to: “rocky:8081/obix/index.html”

100%[==============================================================================>] 792 --.-K/s in 0s

2012-10-19 09:29:51 (86,0 MB/s) - “rocky:8081/obix/index.html” saved [792/792]

FINISHED --2012-10-19 09:29:51-- Downloaded: 1 files, 792 in 0s (86,0 MB/s)

Pierre Voisin
  • 661
  • 9
  • 11

1 Answers1

1

Usually there's no need to give the user-agent.

It should be sufficient to give:

wget -r http://stackoverflow.com/questions/12955253/recursive-wget-wont-work

To see, why wget doesn't do what you want, look at the output it is giving you and post it here.

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
  • Indeed, it works on this page but when I then try the following on my local host it stops at the very first document: `wget -r --follow-tags=ref,a http://rocky:8081/obix/` Note: links I want to follow are not regular HTML anchors but "ref" tags so I've changed the `--follow-tags` option accordingly. I don't think it's any problem though since I'm not able to crawl many of other regular sites. – Pierre Voisin Oct 19 '12 at 13:26
  • Wget should tell you, why it can't download anything. As I said, look at the output. – Olaf Dietsche Oct 19 '12 at 13:28
  • Apologies, I'm quite new to posting on StackOverflow and I've been battling a bit with the markup... ;) Please see the output as an edition to my post above. – Pierre Voisin Oct 19 '12 at 13:48
  • Since wget downloads "index.html" successfully, I guess there are no `ref` and `a` links. In the output wget says: "[text/xml]", so maybe, it's not html. Have you looked into index.html? – Olaf Dietsche Oct 19 '12 at 13:58
  • Yes, there are! I just had a look to one of the most recent version of wget and it seems to be based on known tags/attributes lists... which don't contain "ref" so I believe I'm screwed. :) – Pierre Voisin Oct 19 '12 at 14:21
  • 2
    `man wget` states: > Wget can follow links in HTML, XHTML, and CSS pages, to create local ... Since this is an XML file, you're out of luck. – Olaf Dietsche Oct 19 '12 at 14:49