1

I use the following command to append the browser's response from list of URLs into an according output file:

wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
     >> /Applications/MAMP/htdocs/data/export.txt

This works fine and when finished it says:

Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)

In order to speed this up I used:

cat /Applications/MAMP/htdocs/data/urls.txt | \
   tr -d '\r' | \
   xargs -P 10 $(which wget) -i - -O - \
   >> /Applications/MAMP/htdocs/data/export.txt

Which opens simultaneous connections making it a little faster:

Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)

As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?

On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.

My API Rate-Limiting is as follows:

Number of requests per minute 100

Number of mapping jobs in a single request 100

Total number of mapping jobs per minute 10,000

Ava Barbilla
  • 968
  • 2
  • 18
  • 37

2 Answers2

3

Have you tried GNU Parallel? It will be something like this:

parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt

You can use this to see what it will do without actually doing anything:

parallel --dry-run ...

And either of these to see progress:

parallel --progress ...
parallel --bar ...

As your input file seems to be a bit of a mess, you can strip carriage returns like this:

tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Have upvoted, because it achieves OP's aims, but it would be nice to explain why the original approach was broken. – slim Jul 20 '17 at 14:50
  • When I run your command @MarkSetchell it starts parsing the URLs adding `%0D` at the end of each line. This is why I use `| tr -d '\r' |` in order to strip any carriage returns. When I apply `| tr -d '\r' |` to your command I receive the following response: `wget: missing URL` and `: No such file or directory:8888/data/test.php?isin= ABC123456789`. How can I properly remove all of the `%0D`? – Ava Barbilla Jul 20 '17 at 17:08
  • Please have another look. – Mark Setchell Jul 20 '17 at 17:49
  • It works fine @MarkSetchell. I think the **API** restriction avoids faster parsing because I achieve the same result. It starts fast at the beginning and then at some point it stops downloading. When I access the link directly with a parameter I successfully extracted on a previous run, the browser returns an empty response, which should not be the case. Note that I do this whilst the Terminal is downloading at the same time. This also worked for me `cat /Applications/MAMP/htdocs/data/urls.txt | tr -d '\r' | parallel -j 8 wget {} -O - > /Applications/MAMP/htdocs/data/export.txt`. Thanks alot! – Ava Barbilla Jul 20 '17 at 20:17
1

A few things:

  • I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
  • man xargs advises you to "Use the -n option with -P; otherwise chances are that only one exec will be done."
  • You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
  • To debug, substitute echo for wget and check how it's batching the parameters

So this should work:

 cat urls.txt | \
 xargs --max-procs=10 --max-args=100 wget --output-document=- 

(I've preferred long params - --max-procs is -P. --max-args is -n)

See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.

However, in most circumstances I would not expect parallelising to significantly increase your download rate.

In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.

So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

slim
  • 40,215
  • 13
  • 94
  • 127
  • Hi @slim . Actually I'm parsing a list of identical links, which means I am targeting the same file with different sets of parameters. An example would be http://localhost:8888/data/test.php?value=ABC123456789. See my previous [question](https://stackoverflow.com/questions/45012726/wget-error-414-request-uri-too-large) – Ava Barbilla Jul 20 '17 at 16:17
  • That doesn't change anything about my answer. – slim Jul 20 '17 at 16:28
  • I said that due to your statement **fetching from multiple servers**. I run your command `cat /Applications/MAMP/htdocs/data/urls.txt | tr -d '\r' | xargs -P 10 -n 100 wget -O - >> /Applications/MAMP/htdocs/data/export.txt` where `tr` is used to strip any carriage returns in this case being `%0D`. It runs the URLs but still skips some of them. I know this because my first successful fetch includes all the 9999 files and the URLs that start appearing "empty" are not supposed to be empty. I check these in my database. I updated my question with some API details. – Ava Barbilla Jul 20 '17 at 17:17