I'm not a computer science major so any insight would be really
appreciated.
First important thing is understand what host is, parts of URL according to RFC 1738 are
//<user>:<password>@<host>:<port>/<url-path>
user, password, @, port and path are optional, so if you have URL
https://pubs.broadinstitute.org/diabimmune/data/9
then host is
pubs.broadinstitute.org
One might get information about link using --spider
option of GNU wget
wget --spider https://pubs.broadinstitute.org/diabimmune/data/9
will provide output, important parts are
HTTP request sent, awaiting response... 302 Found
Location: https://diabimmune.broadinstitute.org/diabimmune/data/9 [following]
...
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://diabimmune.broadinstitute.org/diabimmune/data/9/ [following]
...
HTTP request sent, awaiting response... 200 OK
Length: 298054 (291K) [text/html]
Remote file exists
Status code 301 and 302 are redirects, observe that now host is different - diabimmune.broadinstitute.org
- however if you download it, it does contain links with pubs.broadinstitute.org
host, therefore GNU wget
does not retrieve them.
how I can avoid this in future, I gather the -rH parameter can wreak
havoc if improperly used.
This might give effect of downloading files you actually do not need, so problems which might arise from that are clogging network connection or your disk space. Spanning Hosts suggest combining -rH
with -D
to get only domain you actually wants, in your case that might be
wget -rH -Dbroadinstitute.org -np -nd https://pubs.broadinstitute.org/diabimmune/data/9/
Observe that there broadinstitute.org
means retrieve if domain ends with broadinstitute.org
, which is true for both pubs.broadinstitute.org
and diabimmune.broadinstitute.org