Currently I am struggling with mirroring a website using Wget. Browsing the web I came out with the following command to mirror a complete website:
wget --mirror --convert-links --adjust-extension --backup-converted --page-requisites -e robots=off http://www.example.com
As expected, after running the command there is a folder called www.example.com containing all downloaded files. However, some background images are missing. Digging through the files and logs I found that wget seems to have a problem with quoted image URLs.
The website uses the following CSS to include a background image:
<div ... style="background-image: url("/path/to/image") ;..." ... />
Collecting the pages requisites wget parses the URL and tries to download the file,
http://www.example.com/"/path/to/image"
which obviously fails with an error 404:
--2018-01-08 18:04:00-- https://www.example.com/"/path/to/image"
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-01-08 18:04:00 ERROR 404: Not Found
Unfortunately I cannot post the original domain for privacy reasons...
I already tried to find a solution on the web, but I did not manage to find the right keywords to search for, so as a last choice I must ask you for help.
Is there a way to tell Wget to ignore quotes inside URLs?