315

There is an online HTTP directory that I have access to. I have tried to download all sub-directories and files via wget. But, the problem is that when wget downloads sub-directories it downloads the index.html file which contains the list of files in that directory without downloading the files themselves.

Is there a way to download the sub-directories and files without depth limit (as if the directory I want to download is just a folder which I want to copy to my computer).

online HTTP directory

leiyc
  • 903
  • 11
  • 23
Omar
  • 6,681
  • 5
  • 21
  • 36

8 Answers8

549

Solution:

wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

Explanation:

  • It will download all files and subfolders in ddd directory
  • -r : recursively
  • -np : not going to upper directories, like ccc/…
  • -nH : not saving files to hostname folder
  • --cut-dirs=3 : but saving it to ddd by omitting first 3 folders aaa, bbb, ccc
  • -R index.html : excluding index.html files

Reference: http://bmwieczorek.wordpress.com/2008/10/01/wget-recursively-download-all-files-from-certain-directory-listed-by-apache/

gibbone
  • 2,300
  • 20
  • 20
Mingjiang Shi
  • 7,415
  • 2
  • 26
  • 31
  • 31
    Thank you! Also, FYI according to [this](http://unix.stackexchange.com/questions/53397/wget-how-to-download-recursively-and-only-specific-mime-types-extensions-i-e) you can use `-R` like `-R css` to exclude all CSS files, or use `-A` like `-A pdf` to only download PDF files. – John Apr 13 '15 at 20:52
  • 21
    Thanks! Additional advice taken from [wget man page](https://www.gnu.org/software/wget) `When downloading from Internet servers, consider using the ‘-w’ option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by your rudeness.` – jgrump2012 Jul 08 '16 at 16:26
  • 7
    I get this error 'wget' is not recognized as an internal or external command, operable program or batch file. – hamish Mar 05 '17 at 01:42
  • 4
    @hamish you may need to install wget first or the wget is not in your $PATH. – Mingjiang Shi Mar 07 '17 at 03:30
  • 30
    Great answer, but note that if there is a `robots.txt` file disallowing the downloading of files in the directory, this won't work. In that case you need to add `-e robots=off `. See https://unix.stackexchange.com/a/252564/10312 – Daniel Hershcovich Apr 16 '18 at 11:02
  • I've install wget and can't get this to work. Not at all with cmd.exe but somewhat in windows powershell. If I just enter "wget http://someurl" it gives me a bunch of info but if I try to add any of the parameters I get an error that a paramater cannot be found that matches parameter name 'r' – MilkyTech Jan 08 '19 at 23:12
  • In mac : `Warning: Invalid character is found in given range. A specified range MUST Warning: have only digits in 'start'-'stop'. The server's response to this Warning: request is uncertain. curl: no URL specified! curl: try 'curl --help' or 'curl --manual' for more information` no result – user305883 Jan 30 '19 at 19:46
  • @user305883 the warning message you posted is from curl? – Mingjiang Shi Jan 31 '19 at 01:24
  • @MingjiangShi from wget (Command line from your answer). I also tried `curl -O 'http://example.com/directory/'` but does not go through : `curl: Remote file name has no length!`there is an html page with `
    name.pdf
    name2.pdf
    image1.png
    name3.pdf...
    ` and I wish to download all the listed documents (in the href).
    – user305883 Jan 31 '19 at 08:57
  • what about https? I have the warning: OpenSSL: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure Unable to establish SSL connection. – Yannis Dran May 04 '19 at 03:06
  • 1
    To get rid of all the different types of index files (index.html?... etc) you need to ensure you add: -R index.html* – Jolly1234 Feb 14 '20 at 21:23
  • What about downloading a **specific file type** using **VisualWget**? Is it possible to download only **mp3** files in a directory and its sub-directories in **VisualWget**? –  May 30 '20 at 07:14
  • can anybody help me out, i have only getting 1 file index.html.tmp and a blank folder, can you please help me out what is the issue? – Mujtaba Oct 28 '20 at 16:56
  • I recommend below option: --reject-regex "(.*)\?(.*)" – Namo Nov 14 '20 at 13:33
  • php files are all blank – ßiansor Å. Ålmerol Jun 12 '21 at 08:25
  • 1
    This command works for me. Just one more thing, if there are other UTF-8 characters, we can add one more parameter "--restrict-file-names=nocontrol". – MadHatter Jul 20 '21 at 06:21
  • Unfortunately, doesn't work for above case. It follows parent directory regardless --no-parent flag. – 0script0 Oct 23 '21 at 21:23
  • Note that since the default depth limit of recursion is 5, you have to increase it by typing '-l ' to set the depth limit as desired. Use 'inf' or '0' for infinite depth. – Akhil Raj Nov 07 '21 at 20:58
  • I am using this : `wget -r -np -nH --no-check-certificate -e robots=off --cut-dirs=4 -R index.html http://example.com` – ishandutta2007 Aug 25 '23 at 02:10
68

I was able to get this to work thanks to this post utilizing VisualWGet. It worked great for me. The important part seems to be to check the -recursive flag (see image).

Also found that the -no-parent flag is important, othewise it will try to download everything.

enter image description here enter image description here

mateuscb
  • 10,150
  • 3
  • 52
  • 76
  • 6
    Just found this - Dec 2017. It works fine. I got it at https://sourceforge.net/projects/visualwget/ – SDsolar Dec 09 '17 at 07:02
  • 5
    Worked fine on Windows machine, don't forget to check in the options mentioned in the answer , else it won't work – coder3521 Dec 28 '17 at 08:50
  • 4
    Doesn't work with certain https. @DaveLucre if you tried with wget in cmd solution you would be able to download as well, but some severs do not allow it I guess – Yannis Dran May 04 '19 at 03:02
  • what does checked `--no-parent` do? – T.Todua Aug 08 '19 at 11:33
  • it's the same setting as `wget` (as one of the other answers here): **‘-np’ ‘--no-parent’** Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details. – mateuscb Aug 08 '19 at 15:50
  • 4
    Working in March 2020! – Mr Programmer Mar 11 '20 at 18:00
  • What about downloading a **specific file type** using **VisualWget**? Is it possible to download only **mp3** files in a directory and its sub-directories in **VisualWget**? –  May 30 '20 at 07:14
  • 4
    Latest version of vwget (2.4.105.0) uses wget version 1.11, this does not work with with HTTPS sites. See this post for more info, could not get this to work at all unfortunately. https://stackoverflow.com/questions/28757232/unable-to-establish-ssl-connection-upon-wget-on-ubuntu-14-04-lts – Dave Jun 21 '22 at 19:13
27

you can use lftp, the swish army knife of downloading if you have bigger files you can add --use-pget-n=10 to command

lftp -c 'mirror --parallel=100 https://example.com/files/ ;exit'
nwgat
  • 661
  • 7
  • 11
  • 4
    worked perfectly and really fast, this maxed out my internet line downloading thousands of small files. Very good. – n13 Jun 27 '20 at 19:47
  • 4
    Explain what these parametres do please – leetbacoon Nov 26 '20 at 08:37
  • 3
    -c = continue, mirror = mirrors content locally, parallel=100 = downloads 100 files, ;exit = exits the program, use-pget = splits bigger files into segments and downloads parallels – nwgat Dec 17 '20 at 06:55
  • 3
    I had issues with this command. Some videos I was trying to download were broken. If I download them normally and individually from the browser it works perfectly. – Hassen Ch. Dec 30 '20 at 13:12
  • 3
    The most voted solution has no problem with any file. All good! – Hassen Ch. Dec 30 '20 at 13:34
  • Thanks @nwgat it worked like a charm, and matched my requirements. – Jahan Zinedine May 08 '21 at 18:41
  • This worked really well for me, exactly what I needed for my problem. Plus it is blindingly fast, especially with the --use-pget switch set. Thanks @nwgat – corl Sep 27 '21 at 04:00
  • Does this work from the command line in Windows 10? – Mark Miller May 29 '22 at 18:32
  • 1
    Is there an `rsync` like option, to not download files that have already been downloaded and haven't changed? – CivFan Jan 11 '23 at 21:26
18
wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/

From man wget

‘-r’ ‘--recursive’ Turn on recursive retrieving. See Recursive Download, for more details. The default maximum depth is 5.

‘-np’ ‘--no-parent’ Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.

‘-nH’ ‘--no-host-directories’ Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

‘--cut-dirs=number’ Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.

Take, for example, the directory at ‘ftp://ftp.xemacs.org/pub/xemacs/’. If you retrieve it with ‘-r’, it will be saved locally under ftp.xemacs.org/pub/xemacs/. While the ‘-nH’ option can remove the ftp.xemacs.org/ part, you are still stuck with pub/xemacs. This is where ‘--cut-dirs’ comes in handy; it makes Wget not “see” number remote directory components. Here are several examples of how ‘--cut-dirs’ option works.

No options -> ftp.xemacs.org/pub/xemacs/ -nH -> pub/xemacs/ -nH --cut-dirs=1 -> xemacs/ -nH --cut-dirs=2 -> .

--cut-dirs=1 -> ftp.xemacs.org/xemacs/ ... If you just want to get rid of the directory structure, this option is similar to a combination of ‘-nd’ and ‘-P’. However, unlike ‘-nd’, ‘--cut-dirs’ does not lose with subdirectories—for instance, with ‘-nH --cut-dirs=1’, a beta/ subdirectory will be placed to xemacs/beta, as one would expect.

Ryan R
  • 8,342
  • 15
  • 84
  • 111
Natalie Ng
  • 191
  • 1
  • 2
  • 6
    Some explanations would be great. – Benoît Latinier Jun 19 '17 at 20:47
  • 3
    What about downloading a **specific file type** using **VisualWget**? Is it possible to download only **mp3** files in a directory and its sub-directories in **VisualWget**? –  May 30 '20 at 07:15
9

No Software or Plugin required!

(only usable if you don't need recursive deptch)

Use bookmarklet. Drag this link in bookmarks, then edit and paste this code:

javascript:(function(){ var arr=[], l=document.links; var ext=prompt("select extension for download (all links containing that, will be downloaded.", ".mp3"); for(var i=0; i<l.length; i++) { if(l[i].href.indexOf(ext) !== false){ l[i].setAttribute("download",l[i].text); l[i].click(); } } })();

and go on page (from where you want to download files), and click that bookmarklet.

T.Todua
  • 53,146
  • 19
  • 236
  • 237
4

wget is an invaluable resource and something I use myself. However sometimes there are characters in the address that wget identifies as syntax errors. I'm sure there is a fix for that, but as this question did not ask specifically about wget I thought I would offer an alternative for those people who will undoubtedly stumble upon this page looking for a quick fix with no learning curve required.

There are a few browser extensions that can do this, but most require installing download managers, which aren't always free, tend to be an eyesore, and use a lot of resources. Heres one that has none of these drawbacks:

"Download Master" is an extension for Google Chrome that works great for downloading from directories. You can choose to filter which file-types to download, or download the entire directory.

https://chrome.google.com/webstore/detail/download-master/dljdacfojgikogldjffnkdcielnklkce

For an up-to-date feature list and other information, visit the project page on the developer's blog:

http://monadownloadmaster.blogspot.com/

Peter
  • 3,186
  • 3
  • 26
  • 59
Moscarda
  • 373
  • 4
  • 14
2

You can use this Firefox addon to download all files in HTTP Directory.

https://addons.mozilla.org/en-US/firefox/addon/http-directory-downloader/

Rushikesh Tade
  • 473
  • 4
  • 7
1

wget generally works in this way, but some sites may have problems and it may create too many unnecessary html files. In order to make this work easier and to prevent unnecessary file creation, I am sharing my getwebfolder script, which is the first linux script I wrote for myself. This script downloads all content of a web folder entered as parameter.

When you try to download an open web folder by wget which contains more then one file, wget downloads a file named index.html. This file contains a file list of the web folder. My script converts file names written in index.html file to web addresses and downloads them clearly with wget.

Tested at Ubuntu 18.04 and Kali Linux, It may work at other distros as well.

Usage :

  • extract getwebfolder file from zip file provided below

  • chmod +x getwebfolder (only for first time)

  • ./getwebfolder webfolder_URL

such as ./getwebfolder http://example.com/example_folder/

Download Link

Details on blog