crawl links of sitemap.xml through wget command

Question

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:

Remote file exists but does not contain any link -- not retrieving.

But for sure the sitemap.xml is full of "http://..." links.

I tried almost every option of wget but nothing worked for me:

wget -r --mirror http://mysite.com/sitemap.xml

Does anyone knows how to open all links inside of a website sitemap.xml?

Thanks, Dominic

score 17 · Answer 1 · answered Jan 02 '14 at 13:35

17

It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this:

wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -

I learned this trick here.

answered Jan 02 '14 at 13:35

2

How can I set each downloaded html filename to the title of the page? Right now everything is just index.html, index.html.1, index.html.2 etc – user2028856 Feb 16 '15 at 09:39
would like to see an elaboration on this that accounts for many children in the xml file structure...anyone? This is nice though! Thank you! – lcm Jul 18 '15 at 23:30
I have the sitemap in gz format, how should I visit urls from that. – lightsaber Oct 17 '16 at 10:04
Also I want get .amp pages, which are not list in sitemap. How do create amp url from these urls. My amp urls use .amp extension. – lightsaber Oct 17 '16 at 18:45

score 4 · Answer 2 · answered Mar 05 '19 at 14:24

While this question is older, google send me here.

I finally used xsltproc to parse the sitemap.xml:

sitemap-txt.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
            xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9"
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
    <xsl:template match="/">
        <xsl:for-each select="sitemap:urlset/sitemap:url">
            <xsl:value-of select="sitemap:loc"/><xsl:text>&#xa;</xsl:text>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

Using it (in this case it is from a cache-prewarming-script, so the retrieved pages are not kept ("-o /dev/null"), only some statistics are printed ("-w ....")):

curl -sS http://example.com/sitemap.xml | xsltproc sitemap-txt.xsl - | xargs -n1 -r -P4 curl -sS -o /dev/null -w "%{http_code}\t%{time_total}\t%{url_effective}\n"

(Rewriting this to use wget instead of curl is left as an exercise for the reader ;-) ) What this does is:

Retrieve sitemap.xml
Parse sitemap, output url-list as texts (one url per line)
use xargs to call "curl" on each url, using 4 requests in parallel)

This should be marked as the answer. Parsing the XML according to the sitemap schema is the only reliable solution. — Andrew Bate, Oct 23 '21 at 22:34

score -2 · Answer 3 · answered Aug 02 '18 at 12:26

You can use one of the sitemapping tools. Try Slickplan. It has the site crawler option and by using it you can import a structure of existing website and create a visual sitemap from it. Then you can export it to Slickplan XML format, which contains* not only links, but also SEO metadata, page titles (product names), and a bunch of other helpful data.

crawl links of sitemap.xml through wget command

3 Answers3

Linked