1

I use wget to access a list of links from a text file. A link example would be:

http://localhost:8888/data/test.php?value=ABC123456789

The PHP file returns a table with information from which the response is to be appended to another text file. As to the error, it is obvious that currently it cannot handle the amount of URLs because it exceeds the character limit. If I use 2 URLs only, it works perfectly fine.

The text file contains a total of 10 000 URLs. The command I am using is:

wget -i /Applications/MAMP/htdocs/data/URLs.txt -O - >> /Applications/MAMP/htdocs/data/append.txt

According to my research, a quick way to "fix" this is to change the LimitRequestLine or adding it if it does not exist. Since I use MAMP (for MacOS) what I did was:

Open /Applications/MAMP/conf/apache/httpd.conf

And insert under AccessFileName .htaccess:

LimitRequestLine 1000000000
LimitRequestFieldSize 1000000000

But I still get the same error. I don't know why this happens.

May it be easier to use cURL? If yes, what would be a similar command?

Ava Barbilla
  • 968
  • 2
  • 18
  • 37

1 Answers1

3

your 414: Request-URI Too Large error has nothing to do with the amount of urls, and no, using curl wouldn't help.

the problem is that some (or 1?) of your urls is simply too long for the target server, causing the error.

you can probably identify the url causing the error by doing

cat URLs.txt | awk '{print length, $0}' | sort -nr | head -1

(thanks to https://stackoverflow.com/a/1655488/1067003 for that command)

another possible cause is that you're not properly line-terminating the urls in URLs.txt , and some of the urls (or all of them?) gets concatenated. for the record, the terminating character is "\n", aka hex code 0A - not the \r\n that most windows-editors use, i'm not sure how wget would handle such malformed line terminators (per its definition)

note that if you are downloading loads of .HTML files (or any other compressible files), curl would be much faster than wget, as curl supports compressed transfers with the --compressed argument (utilizing gzip and deflate as of speaking), while wget doesn't support compression at all - and HTML compresses very very well (easily 5-6 times smaller than the uncompressed version with gzip)

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Thanks @hanshenrik ! When I run the code in Terminal, it returns the last URL from the list with some numbers appended to it: `http://localhost:8888/data/test.php?value=ABC9999999995660005`. As you can see the last parameter should be `ABC999999999` where this `5660005` was added at the end. What should I do with this? – Ava Barbilla Jul 16 '17 at 23:14
  • Hi @hanshenrik . I changed the export format to **Windows Formatted Text (.txt)** which worked liked a charm for me. Can this be speeded up by opening simultaneous connections at the same time? Perhaps by using some sort of `xargs`? This would be the last part that would really help me out! – Ava Barbilla Jul 17 '17 at 00:07
  • @AvaBarbilla yes it can. maybe `cat URLs.txt | xargs --max-proc=10 $(which wget)` – hanshenrik Jul 19 '17 at 12:53
  • Where do I introduce my `wget` command? Here `$(which wget)`? When I run `cat /Applications/MAMP/htdocs/data/URLs.txt | xargs --max-proc=10 $(which wget -i /Applications/MAMP/htdocs/data/URLs.txt -O - >> /Applications/MAMP/htdocs/data/append.txt)` in Terminal I receive the following response: `xargs: illegal option -- -`. I'm a newbie in this matter. Could you perhaps briefly explain what each does and how to implement my `wget` command? Thanks for everything! – Ava Barbilla Jul 19 '17 at 15:03
  • @AvaBarbilla maybe `cat /Applications/MAMP/htdocs/data/URLs.txt | xargs --max-proc=10 $(which wget) -i - -O - >> /Applications/MAMP/htdocs/data/append.txt` - but note that, if the target server supports compressed transfers, curl would be faster here – hanshenrik Jul 19 '17 at 16:37
  • When I run your command I still get: `xargs: illegal option -- -`. It also says `usage: xargs [-P maxprocs]` so I changed `--max-proc=10` to `--maxprocs=10` which gave me the same error. Eventually I just used `-P=10` which returned `xargs: max. processes must be >0`. Evidently, 10 is bigger than 0. Why does it not work? @hanshenrik – Ava Barbilla Jul 19 '17 at 18:17
  • So I managed to fix it using `-P 10` instead of `-P=10`, however, It does not parse the links properly. To every url this `%0D` is added at the end. I tried changing the format again and if I do then I receive `xargs: insufficient space for argument`. What do you think? @hanshenrik – Ava Barbilla Jul 19 '17 at 18:32
  • I managed to solve it using `| tr -d '\r'` which strips any carriage returns thanks to this [answer](https://stackoverflow.com/questions/20185095/remove-0d-from-variable-in-bash) – Ava Barbilla Jul 19 '17 at 19:24