0

Is it possible to use wget command for linux in order to get all files in a directory tree of a website?

I can recursively get all of a website with mirror and such, but I would like to just get all files in a single directory. In my mind, it would look something like:

    wget http://www.somesite.com/here/is/some/folders/*

This would download ALL files (doesn't have to recursively look in subdirectories) in the /folders/ directory. But the wildcard character doesn't seem to work with wget so I am looking for the correct way.

Matt Hintzke
  • 7,744
  • 16
  • 55
  • 113

1 Answers1

4

Sure, there's wget -r, which will recurse down everything under folders/, provided there's an index to recurse through.

The other thing you can do is if there's an index.htm or whatever in the folders directory, you can grep, sed, and cut your way through chaining wget to wget, like so:

wget -qO - http://foo/folder/index.htm | sed 's/href=/#/' | cut -d\# -f2 | \
  while read url; do wget $url; done

which is generally what I do when I need to scrape and I can't recurse for whichever reason.

edit:

probably want to add --no-parent and set --domain properly. The wget manpage is actually pretty good and covers this stuff.

jane arc
  • 574
  • 3
  • 16
  • 1
    couldn't get your sed | cut to work cleanly for my case, ended up doing something like `wget -O - http://foo | sed -n 's#^.*href="\([^"]\{1,\}\)".*$#\1#p' | while read url; ...` – zamnuts Jan 16 '14 at 01:29