Is there a way to use wget on a wildcard?

Question

Is it possible to use wget command for linux in order to get all files in a directory tree of a website?

I can recursively get all of a website with mirror and such, but I would like to just get all files in a single directory. In my mind, it would look something like:

    wget http://www.somesite.com/here/is/some/folders/*

This would download ALL files (doesn't have to recursively look in subdirectories) in the /folders/ directory. But the wildcard character doesn't seem to work with wget so I am looking for the correct way.

As you can't get filelist on http, you can't use wildcards. – Eddy_Em Oct 22 '13 at 21:20 — Eddy_Em, Oct 22 '13 at 21:20

jane arc · Answer 1 · 2013-10-22T22:57:15.027

4

Sure, there's wget -r, which will recurse down everything under folders/, provided there's an index to recurse through.

The other thing you can do is if there's an index.htm or whatever in the folders directory, you can grep, sed, and cut your way through chaining wget to wget, like so:

wget -qO - http://foo/folder/index.htm | sed 's/href=/#/' | cut -d\# -f2 | \
  while read url; do wget $url; done

which is generally what I do when I need to scrape and I can't recurse for whichever reason.

edit:

probably want to add --no-parent and set --domain properly. The wget manpage is actually pretty good and covers this stuff.

edited Oct 22 '13 at 22:57

answered Oct 22 '13 at 22:51

jane arc

574
3
16

1

couldn't get your sed | cut to work cleanly for my case, ended up doing something like `wget -O - http://foo | sed -n 's#^.*href="$[^"]\{1,\}$".*$#\1#p' | while read url; ...` – zamnuts Jan 16 '14 at 01:29

Is there a way to use wget on a wildcard?

1 Answers1