3

Currently I am struggling with mirroring a website using Wget. Browsing the web I came out with the following command to mirror a complete website:

wget --mirror --convert-links --adjust-extension --backup-converted --page-requisites -e robots=off http://www.example.com

As expected, after running the command there is a folder called www.example.com containing all downloaded files. However, some background images are missing. Digging through the files and logs I found that wget seems to have a problem with quoted image URLs.

The website uses the following CSS to include a background image:

<div ... style="background-image: url("/path/to/image") ;..." ... />

Collecting the pages requisites wget parses the URL and tries to download the file,

http://www.example.com/"/path/to/image"

which obviously fails with an error 404:

--2018-01-08 18:04:00-- https://www.example.com/&quot;/path/to/image&quot;
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 404 Not Found
2018-01-08 18:04:00 ERROR 404: Not Found

Unfortunately I cannot post the original domain for privacy reasons...

I already tried to find a solution on the web, but I did not manage to find the right keywords to search for, so as a last choice I must ask you for help.

Is there a way to tell Wget to ignore quotes inside URLs?

Dharman
  • 30,962
  • 25
  • 85
  • 135
Rosso
  • 428
  • 7
  • 17
  • Do you have some debug output that shows wget 404-ing on the quoted urls? If you could post that as well, it would help. – ron rothman Jan 08 '18 at 17:01
  • The css code is embedded as a style parameter in the html tag so wget definitively is parsing it – Rosso Jan 08 '18 at 17:02
  • I think you want `-k`. Did you see this? https://stackoverflow.com/questions/6348289/download-a-working-local-copy-of-a-webpage – JawguyChooser Jan 08 '18 at 17:29
  • according to the documentation `-k` is just the short version of `--convert-links` – Rosso Jan 09 '18 at 07:33
  • 1
    As I was not able to resolve the problem, I moved to _httrack_, which does not seem to have this problem. As a consequence I cannot provide an answer to my question but, if applicable, can suggest an this alternative tool. – Rosso May 02 '18 at 09:54

0 Answers0