mirror single page with httrack

Question

I am trying to use httrack (http://www.httrack.com/) in order to download a single page, not the entire site. So, for example, when using httrack in order to download www.google.com it should only download the html found under www.google.com along with all stylesheets, images and javascript and not follow any links to images.google.com, labs.google.com or www.google.com/subdir/ etc.

I tried the -w option but that did not make any difference.

What would be the right command?

EDIT

I tried using httrack "http://www.google.com/" -O "./www.google.com" "http://www.google.com/" -v -s0 --depth=1 but then it wont copy any images.

What I basically want is just downloading the index file of that domain along with all assets, but not the content of any external or internal links.

score 10 · Answer 1 · answered Jan 19 '15 at 22:00

httrack "http://www.google.com/" -O "./www.google.com" "http://www.google.com/" -v -s0  --depth=1 -n

-n option (or --near) will download images on a webpage no matter where it is located.

Say images are located in google.com/foo/bar/logo.png. as, you are using s0(stay on same directory), it will not download the image unless you specify --near

score 10 · Answer 2 · answered May 05 '17 at 13:21

10

Click on "Set Options"
Go to the tab "Limits"
Set "Maximum external depth" to 0

answered May 05 '17 at 13:21

Lucas Bustamante

15,821
7
92
86

score 8 · Accepted Answer · answered Dec 28 '09 at 12:57

8

Could you use wget instead of httrack? wget -p will download a single page and all of its “prerequisites” (images, stylesheets).

answered Dec 28 '09 at 12:57

Kevin Reid

37,492
13
80
108

1

wget would be my fallback solution if httrack cant do the job. – Max Dec 28 '09 at 14:57
2

the question about `httrack`, so stay on track. wget doesn't execute JS – Toolkit Feb 08 '17 at 15:58
`wget` fails if resources have querystrings. It download files named with the querystring itself. – keul Apr 20 '17 at 08:56
1

`wget` does not work properly for some sites/pages. I needed to use `httrack` as per @torger's answer below to get all the required CSS files and have the links corrected. – gone Jan 28 '19 at 11:12

torger · Answer 4 · 2009-12-28T23:32:58.527

2

Looking at the example:

httrack "http://www.all.net/" -O "/tmp/www.all.net" "+*.all.net/*" -v

The last part is a regex. Just make a completely matching regex.

httrack "http://www.google.com.au/" -O "/tmp/www.google.com.au" "+*.google.com.au/*" -v ---depth=2 --ext-depth=2

I had to localise, otherwise I get a redirect page. You should localise to whichever google you get directed to.

edited Dec 28 '09 at 23:32

answered Dec 28 '09 at 08:03

torger

2,308
4
28
35

That helped, but was not quite right. Could you please see my edit? – Max Dec 28 '09 at 09:50
This seems to copy images, and the js. – torger Dec 28 '09 at 23:33
You have a superfluous `-` in your parameters. – gone Jan 28 '19 at 11:07

score 1 · Answer 5 · answered Dec 28 '09 at 08:01

1

The purpose of HTTTrack is to follow links. Try setting --ext-depth=0.

answered Dec 28 '09 at 08:01

Gregory Pakosz

69,011
20
139
164

mirror single page with httrack

5 Answers5

Linked