While this question is older, google send me here.
I finally used xsltproc to parse the sitemap.xml:
sitemap-txt.xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:template match="/">
<xsl:for-each select="sitemap:urlset/sitemap:url">
<xsl:value-of select="sitemap:loc"/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Using it (in this case it is from a cache-prewarming-script, so the retrieved pages are not kept ("-o /dev/null"), only some statistics are printed ("-w ....")):
curl -sS http://example.com/sitemap.xml | xsltproc sitemap-txt.xsl - | xargs -n1 -r -P4 curl -sS -o /dev/null -w "%{http_code}\t%{time_total}\t%{url_effective}\n"
(Rewriting this to use wget instead of curl is left as an exercise for the reader ;-) )
What this does is:
- Retrieve sitemap.xml
- Parse sitemap, output url-list as texts (one url per line)
- use xargs to call "curl" on each url, using 4 requests in parallel)