I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs
and jump to the next URL when the ‘STDOUT’
is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000