wget,curl - downloaded page content encodes + sign as "+"

Question

As title says. eg do

wget "https://www.veracrypt.fr/en/Downloads.html" --local-encoding=utf-8 --remote-encoding=utf-8 -O - | less

Note the <a href> elements in the page contain +download, not +download as expected. Feeding that encoded url into wget (or curl) again causes download to fail.

How to work around this? To be clear, the aim is to wget/curl contents of a page, grep a download link from it, and wget the asset link is pointed at.

score 1 · Accepted Answer · answered Jul 09 '20 at 16:23

1

That's what's literally in the code:

<a href="https://launchpad.net/veracrypt/trunk/1.24-update6/&#43;download/VeraCrypt%20Setup%201.24-Update6.exe">

So wget is just giving you what it got. Remember that within an element attribute you can escape characters using HTML entity escaping. This is valid HTML and a compliant browser will properly decode this before using it.

You can do the same with any HTML entity decoder. Unless your fetching tool can decode them for you, you'll first have to decode these yourself.

answered Jul 09 '20 at 16:23

tadman

208,517
23
234
262

That explains. Searched wget&curl man-pages, but can't figure out how to automatically decode it. Surely one of them should be capable of it? – basher Jul 09 '20 at 16:29
Why would they? Their job is to fetch content, not parse HTML and interpret stuff like this. If you want an HTML parser you'll need another tool. Most of those can pull out the attributes in their decoded form. `wget` is already complex enough. – tadman Jul 09 '20 at 16:30
Cheers, suppose i've become too complacent with it. But your point is valid, they're not one-size-fits-all tools. Will decode the urls w/ `html2text` and all's good. Thanks! – basher Jul 09 '20 at 16:36
There's a long history of UNIX-type tools doing one thing and doing it well, so you'll often have to combine several into a solution. Not everything is a Swiss Army chainsaw like Perl. – tadman Jul 09 '20 at 17:14

wget,curl - downloaded page content encodes + sign as "+"

1 Answers1