Using wget to fake browser?

Question

I'd like to crawl a web site to build its sitemap.

Problems is, the site uses an htaccess file to block spiders, so the following command only downloads the homepage (index.html) and stops, although it does contain links to other pages:

wget -mkEpnp -e robots=off -U Mozilla http://www.acme.com

Since I have no problem accessing the rest of the site with a browser, I assume the "-e robots=off -U Mozilla" options aren't enough to have wget pretend it's a browser.

Are there other options I should know about? Does wget handle cookies by itself?

Thank you.

--

Edit: I added those to wget.ini, to no avail:

hsts=0
robots = off
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /

--

Edit: Found it.

The pages linked to in the homepage were on a remote server, so wget would ignore them. Just add "--span-hosts" to tell wget to go there, and "-D www.remote.site.com" if you want to restrict spidering to that domain.

`--span-hosts` and `-D` together did the trick for me, thanks a lot! If you add this as an answer I will upvote it. — Gabriel Devillers, Jul 26 '18 at 19:25

score 27 · Answer 1 · edited Mar 26 '20 at 00:16

27

you might want to set the User-Agent to something more than just Mozilla, something like:

wget --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

edited Mar 26 '20 at 00:16

Avinash Raj

172,303
28
230
274

answered Apr 03 '17 at 13:45

Giuseppe Scrivano

1,385
10
13

1

Checking the request headers didn't help : "Upgrade-Insecure-Requests" isn't support, and the User-agent makes no difference http://stackoverflow.com/questions/4423061/view-http-headers-in-google-chrome – Gulbahar Apr 03 '17 at 14:48

Using wget to fake browser?

1 Answers1

Linked