0

I was trying to download a nice dataset from this URL.

https://pubs.broadinstitute.org/diabimmune/data/9

using the referenced command listed https://diabimmune.broadinstitute.org/diabimmune/t1d-cohort/resources/16s-sequence-data

Alas, it was only returning the index file.

I'm not one to give up, and after several hours search I came across this post wget downloads only one index.html file instead of other some 500 html files

Whose solution worked wonders (so far, still downloading haha)

I don't understand why this is though, the link used as an example in the above post is dead, so I'm not sure what the domain issue is, or how I can avoid this in future, I gather the -rH parameter can wreak havoc if improperly used.

I'm not a computer science major so any insight would be really appreciated.

Cheers.

wget -r -np -nd https://pubs.broadinstitute.org/diabimmune/data/9/ -H

Andy
  • 5
  • 1

1 Answers1

0

I'm not a computer science major so any insight would be really appreciated.

First important thing is understand what host is, parts of URL according to RFC 1738 are

//<user>:<password>@<host>:<port>/<url-path>

user, password, @, port and path are optional, so if you have URL

https://pubs.broadinstitute.org/diabimmune/data/9

then host is

pubs.broadinstitute.org

One might get information about link using --spider option of GNU wget

wget --spider https://pubs.broadinstitute.org/diabimmune/data/9

will provide output, important parts are

HTTP request sent, awaiting response... 302 Found
Location: https://diabimmune.broadinstitute.org/diabimmune/data/9 [following]
...
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://diabimmune.broadinstitute.org/diabimmune/data/9/ [following]
...
HTTP request sent, awaiting response... 200 OK
Length: 298054 (291K) [text/html]
Remote file exists

Status code 301 and 302 are redirects, observe that now host is different - diabimmune.broadinstitute.org - however if you download it, it does contain links with pubs.broadinstitute.org host, therefore GNU wget does not retrieve them.

how I can avoid this in future, I gather the -rH parameter can wreak havoc if improperly used.

This might give effect of downloading files you actually do not need, so problems which might arise from that are clogging network connection or your disk space. Spanning Hosts suggest combining -rH with -D to get only domain you actually wants, in your case that might be

wget -rH -Dbroadinstitute.org -np -nd https://pubs.broadinstitute.org/diabimmune/data/9/

Observe that there broadinstitute.org means retrieve if domain ends with broadinstitute.org, which is true for both pubs.broadinstitute.org and diabimmune.broadinstitute.org

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Hey man, really appreciate the response here! Very insightful. Thanks for clarifying! :) – Andy Jun 10 '23 at 16:59