2

I'm retrieving large amount of data via wget, with the following command:

wget --save-cookies ~/.urs_cookies --load-cookies ~/.urs_cookies --keep-session-cookies --content-disposition -i links.dat

My problem is that links.dat contains thousands of links. The files are relatively small (100kb). So it takes 0.2s to download the file, and 5s to await for HTTP request response. So it ends up taking 14h to download my whole data, most of the time spent waiting for the requests.

URL transformed to HTTPS due to an HSTS policy
--2017-02-15 18:01:37--  https://goldsmr4.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA2%2FM2I1NXASM.5.12.4%2F1980%2F01%2FMERRA2_100.inst1_2d_asm_Nx.19800102.nc4&FORMAT=bmM0Lw&BBOX=43%2C1.5%2C45%2C3.5&LABEL=MERRA2_100.inst1_2d_asm_Nx.19800102.SUB.nc4&FLAGS=&SHORTNAME=M2I1NXASM&SERVICE=SUBSET_MERRA2&LAYERS=&VERSION=1.02&VARIABLES=t10m%2Ct2m%2Cu50m%2Cv50m
Connecting to goldsmr4.gesdisc.eosdis.nasa.gov (goldsmr4.gesdisc.eosdis.nasa.gov)|198.118.197.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50223 (49K) [application/octet-stream]
Saving to: ‘MERRA2_100.inst1_2d_asm_Nx.19800102.SUB.nc4.1’

This might be a really noob question, but it seems really counter productive that this is working this way. I have really little knowledge on what is happening behind the scenes, but I just wanted to be sure I'm not doing anything wrong and that the process can indeed be faster.

If details helps, I'm downloading MERRA-2 data for specific nodes.

Thanks!

Miguel
  • 109
  • 6

1 Answers1

1

Wget will re-use an existing connection for multiple requests to the same server, potentially saving you the time required to establish and tear down the socket.

You can do this by providing multiple URLs on the command line. For example, to download 100 per batch:

#!/usr/bin/env bash

wget_opts=(
 --save-cookies ~/.urs_cookies
 --load-cookies ~/.urs_cookies
 --keep-session-cookies
 --content-disposition
)

manyurls=()
while read url; do
  manyurls+=( "$url" )
  if [ ${#manyurls[@]} -eq 100 ]; then
    wget "${wget_opts[@]}" "${manyurls[@]}"
    manyurls=()
  fi
done < links.dat

if [ ${#manyurls[@]} -gt 0 ]; then
  wget "${wget_opts[@]}" "${manyurls[@]}"
fi

Note that I haven't tested this. It may work. If it doesn't, tell me your error and I'll try to debug.

So ... that's "connection re-use" or "keepalive". The other thing that would speed up your download is HTTP Pipelining, which basically allows a second request to be sent before the first response has been received. wget does not support this, and curl supports it in its library, but not the command-line tool.

I don't have a ready-made tool to suggest that supports HTTP pipelining. (Besides which, tool recommendations are off-topic.) You can see how pipelining works in this SO answer. If you feel like writing something in a language of your choice that supports libcurl, I'm sure any difficulties you come across make for another an interesting additional StackOverflow question.

Community
  • 1
  • 1
ghoti
  • 45,319
  • 8
  • 65
  • 104
  • Thank you very much, I'll try this out. Basically what your provided is a way to put 100 urls in the end of the command? Like running "wget --options url1 url2 url100"? – Miguel Feb 16 '17 at 22:14
  • Yes, that's exactly it. I figured 100 was a reasonable number -- you probably can't put them ALL there because the command line would be too long. Both wget and curl will recycle a connection that a server leaves open. You'll have no difficulty finding additional documentation about keepalive headers if that appears to be an issue. Another thing to consider is that you can run multiple `wget`s at the same time. But figuring out how to background stuff or managing a pool of backgrounded shell functions is out of scope for this question. – ghoti Feb 16 '17 at 23:04
  • I do think you'd get better performance using pipelining, but you'd need to write your own software to handle it. Combine pipelining and a pool of backgrounded wgets or curls, and you'll get the benefits of everything! Until nasa.gov blocks your IP for hitting there servers too hard, of course. – ghoti Feb 16 '17 at 23:05
  • Hi, thank you once again for the help. Unfortunately, with a quick test (just putting 3 URLS by hand in the end of the command line), time saving was observed: still taking 5s between download completions, with 10% of that time dedicated to the download itself. – Miguel Feb 17 '17 at 17:51
  • @Miguel, I can't replicate your results. When I make multiple requests with wget, one is followed immediately by the next. Do you get the same results using `curl -O url -O url -O url` ? I ran: `time eval curl -O\ http://www.google.com/index.html#{1..100}` (100 requests) and it completed in 15 seconds. `wget http://www.google.com/index.html#{1..100}` took 18 seconds. – ghoti Feb 17 '17 at 18:56
  • Is it possible that it's the *response* that is delayed 5 seconds? If so, you're back to HTTP pipelining, or running a slew of background processes to fetch things in parallel. – ghoti Feb 17 '17 at 19:39
  • I believe the issue is something else. Each link points to a daily .nc4 with different variables over time, outputting an .nc4 with the selected variables. There is also a .netrc file to authenticate and download the data. I'm guessing that most of the time is spent with querying and authenticating. What I'm currently trying now is using OPeNDAP constrains in order to perform a query over the "folder". Instead of having one query/authentication per day, I will run fewer queries in the format of "I want from date X to date Y, with variables Z". Hopefully this should reduce the process. – Miguel Feb 18 '17 at 04:02
  • However, this will return an ascii file, which kind of bugs me, since the .nc4 format is much neater for what I pretend. – Miguel Feb 18 '17 at 04:03
  • 1
    Okay, so if we assume that every GET will include this 5s delay *from the server*, then the only control you have is via pipelining or running multiple requests simultaneously. If you're not planning to write something that takes advantage of pipelining, have a look at [this answer](http://stackoverflow.com/a/1685440/1072112) which might help you control a collection of backgrounded `wget` processes. – ghoti Feb 18 '17 at 14:51
  • That is the solution. I made a quick test running multiple shells with wget and I found no delays between them. Running multiple requests should considerably speed up the whole process. Thank you very very much for all the help! :) – Miguel Feb 21 '17 at 16:06
  • You're very welcome! Glad we could find a solution in comments, even if we determined it was impossible to remove this server delay from the client. :-) – ghoti Feb 21 '17 at 18:05