20

I am trying to use httrack (http://www.httrack.com/) in order to download a single page, not the entire site. So, for example, when using httrack in order to download www.google.com it should only download the html found under www.google.com along with all stylesheets, images and javascript and not follow any links to images.google.com, labs.google.com or www.google.com/subdir/ etc.

I tried the -w option but that did not make any difference.

What would be the right command?

EDIT

I tried using httrack "http://www.google.com/" -O "./www.google.com" "http://www.google.com/" -v -s0 --depth=1 but then it wont copy any images.

What I basically want is just downloading the index file of that domain along with all assets, but not the content of any external or internal links.

Empiromancer
  • 3,778
  • 1
  • 22
  • 53
Max
  • 15,693
  • 14
  • 81
  • 131

5 Answers5

10
httrack "http://www.google.com/" -O "./www.google.com" "http://www.google.com/" -v -s0  --depth=1 -n

-n option (or --near) will download images on a webpage no matter where it is located.

Say images are located in google.com/foo/bar/logo.png. as, you are using s0(stay on same directory), it will not download the image unless you specify --near

Sourav Ghosh
  • 1,964
  • 4
  • 33
  • 43
10
  • Click on "Set Options"
  • Go to the tab "Limits"
  • Set "Maximum external depth" to 0

copy one page only with httrack

Lucas Bustamante
  • 15,821
  • 7
  • 92
  • 86
8

Could you use wget instead of httrack? wget -p will download a single page and all of its “prerequisites” (images, stylesheets).

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
  • 1
    wget would be my fallback solution if httrack cant do the job. – Max Dec 28 '09 at 14:57
  • 2
    the question about `httrack`, so stay on track. wget doesn't execute JS – Toolkit Feb 08 '17 at 15:58
  • `wget` fails if resources have querystrings. It download files named with the querystring itself. – keul Apr 20 '17 at 08:56
  • 1
    `wget` does not work properly for some sites/pages. I needed to use `httrack` as per @torger's answer below to get all the required CSS files and have the links corrected. – gone Jan 28 '19 at 11:12
2

Looking at the example:

httrack "http://www.all.net/" -O "/tmp/www.all.net" "+*.all.net/*" -v

The last part is a regex. Just make a completely matching regex.

httrack "http://www.google.com.au/" -O "/tmp/www.google.com.au" "+*.google.com.au/*" -v ---depth=2 --ext-depth=2

I had to localise, otherwise I get a redirect page. You should localise to whichever google you get directed to.

torger
  • 2,308
  • 4
  • 28
  • 35
1

The purpose of HTTTrack is to follow links. Try setting --ext-depth=0.

Gregory Pakosz
  • 69,011
  • 20
  • 139
  • 164