12

I have a sitemap like this: http://www.site.co.uk/sitemap.xml which is structured like this:

<sitemapindex>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
...

And I want to extract data from it. First of all I need to count how many <sitemap> are in the xml and then for each of them, extract the <loc> and <lastmod> data. Is there an easy way to do this in Python?

I've seen other questions like this but all of them extract for example every <loc> element inside the xml, I need to extract data individually from each element.

I've tried to use lxml with this code:

import urllib2
from lxml import etree

u = urllib2.urlopen('http://www.site.co.uk/sitemap.xml')
doc = etree.parse(u)

element_list = doc.findall('sitemap')

for element in element_list:
    url = store.findtext('loc')
    print url

but element_list is empty.

Hyperion
  • 2,515
  • 11
  • 37
  • 59
  • 1
    A good StackOverflow question shows what you've tried already, and how it's failing. (I wholeheartedly agree with Anand that `lxml` is the right tool for the job; if you try it and have trouble, *then* you'll have cause to ask a question here). – Charles Duffy Jul 07 '15 at 18:04
  • 1
    Could also use https://docs.python.org/2/library/xml.etree.elementtree.html , no? – tandy Jul 07 '15 at 18:05
  • @tandy, sure -- it's built-in, but on the other hand, doesn't have real XPath. I tend to ignore it for the latter reason. – Charles Duffy Jul 07 '15 at 18:06
  • lxml doesn't work, anybody can help me understanding why? – Hyperion Jul 07 '15 at 18:41

8 Answers8

17

I chose to use Requests and BeautifulSoup libraries. I created a dictionary where the key is the url and the value is the last modified date.

from bs4 import BeautifulSoup
import requests

xmlDict = {}

r = requests.get("http://www.site.co.uk/sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
sitemapTags = soup.find_all("sitemap")

print "The number of sitemaps are {0}".format(len(sitemapTags))

for sitemap in sitemapTags:
    xmlDict[sitemap.findNext("loc").text] = sitemap.findNext("lastmod").text

print xmlDict

Or with lxml:

from lxml import etree
import requests

xmlDict = {}

r = requests.get("http://www.site.co.uk/sitemap.xml")
root = etree.fromstring(r.content)
print "The number of sitemap tags are {0}".format(len(root))
for sitemap in root:
    children = sitemap.getchildren()
    xmlDict[children[0].text] = children[1].text
print xmlDict
heinst
  • 8,520
  • 7
  • 41
  • 77
  • 2
    An HTML parser for XML? I mean, it works, but it's going to be needlessly permissive. – Charles Duffy Jul 07 '15 at 18:06
  • @CharlesDuffy Updated my answer...I never used lxml before so it took me a little bit – heinst Jul 07 '15 at 18:39
  • BeautifulSoup says that since is not specified it uses lxml parser by default, then changing `soup = BeautifulSoup(xml)` to `soup = BeautifulSoup(xml, 'lxml')` works perfect! – Hyperion Jul 07 '15 at 19:20
  • @Hyperion it probably changed since you wrote that command because as of today the default parser used by BeautifulSoup is `html.parser`. – bfontaine May 10 '18 at 17:15
7

Using Python 3, requests, Pandas and list comprehension:

import requests
import pandas as pd
import xmltodict

url = "https://www.gov.uk/sitemap.xml"
res = requests.get(url)
raw = xmltodict.parse(res.text)

data = [[r["loc"], r["lastmod"]] for r in raw["sitemapindex"]["sitemap"]]
print("Number of sitemaps:", len(data))
df = pd.DataFrame(data, columns=["links", "lastmod"])

Output:

    links                                       lastmod
0   https://www.gov.uk/sitemaps/sitemap_1.xml   2018-11-06T01:10:02+00:00
1   https://www.gov.uk/sitemaps/sitemap_2.xml   2018-11-06T01:10:02+00:00
2   https://www.gov.uk/sitemaps/sitemap_3.xml   2018-11-06T01:10:02+00:00
3   https://www.gov.uk/sitemaps/sitemap_4.xml   2018-11-06T01:10:02+00:00
4   https://www.gov.uk/sitemaps/sitemap_5.xml   2018-11-06T01:10:02+00:00
petezurich
  • 9,280
  • 9
  • 43
  • 57
4

this function will extract all urls from xml

from bs4 import BeautifulSoup
import requests

def get_urls_of_xml(xml_url):
    r = requests.get(xml_url)
    xml = r.text
    soup = BeautifulSoup(xml)

    links_arr = []
    for link in soup.findAll('loc'):
        linkstr = link.getText('', True)
        links_arr.append(linkstr)

    return links_arr



links_data_arr = get_urls_of_xml("https://www.gov.uk/sitemap.xml")
print(links_data_arr)

bekce
  • 3,782
  • 29
  • 30
JOH
  • 43
  • 4
2

Here using BeautifulSoup to get sitemap count and extract text:

from bs4 import BeautifulSoup as bs

html = """
 <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
"""

soup = bs(html, "html.parser")
sitemap_count = len(soup.find_all('sitemap'))
print("sitemap count: %d" % sitemap)
print(soup.get_text())

Output:

sitemap count: 2

    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml

2015-07-07

    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml

2015-07-07
Bun
  • 3,037
  • 3
  • 19
  • 29
1

You can use advertools which has a special function for parsing XML sitemaps. It also can parse zipped sitemaps by default (.xml.gz). In case you have a sitemap index file, it also recursively gets them all into one DataFrame.


import advertools as adv

economist =  adv.sitemap_to_df('https://www.economist.com/sitemap-2022-Q1.xml')
economist.head()
loc lastmod changefreq priority sitemap etag sitemap_last_modified sitemap_size_mb download_date
0 https://www.economist.com/printedition/2022-01-22 2022-01-20 15:57:17+00:00 daily 0.6 https://www.economist.com/sitemap-2022-Q1.xml e2637d17284eefef7d1eafb9ef4ebe3a 2022-01-22 04:00:54+00:00 0.0865097 2022-01-23 00:01:41.026416+00:00
1 https://www.economist.com/the-world-this-week/2022/01/22/kals-cartoon 2022-01-20 16:53:34+00:00 daily 0.6 https://www.economist.com/sitemap-2022-Q1.xml e2637d17284eefef7d1eafb9ef4ebe3a 2022-01-22 04:00:54+00:00 0.0865097 2022-01-23 00:01:41.026416+00:00
2 https://www.economist.com/united-states/2022/01/22/a-new-barbie-doll-commemorates-a-19th-century-suffragist 2022-01-20 16:10:36+00:00 daily 0.6 https://www.economist.com/sitemap-2022-Q1.xml e2637d17284eefef7d1eafb9ef4ebe3a 2022-01-22 04:00:54+00:00 0.0865097 2022-01-23 00:01:41.026416+00:00
3 https://www.economist.com/britain/2022/01/22/tory-mps-love-to-hate-the-bbc-but-tory-voters-love-to-watch-it 2022-01-20 17:09:59+00:00 daily 0.6 https://www.economist.com/sitemap-2022-Q1.xml e2637d17284eefef7d1eafb9ef4ebe3a 2022-01-22 04:00:54+00:00 0.0865097 2022-01-23 00:01:41.026416+00:00
4 https://www.economist.com/china/2022/01/22/the-communist-party-revisits-its-egalitarian-roots 2022-01-20 16:48:14+00:00 daily 0.6 https://www.economist.com/sitemap-2022-Q1.xml e2637d17284eefef7d1eafb9ef4ebe3a 2022-01-22 04:00:54+00:00 0.0865097 2022-01-23 00:01:41.026416+00:00
Elias Dabbas
  • 46
  • 1
  • 4
  • A bit overkilling to use advertools just for this.... – Laurent Jul 06 '22 at 04:30
  • Why? It's a single line of code. It works recursively by default. Handles zipped and regular sitemaps. Also includes other tags and info, like lastmod, etag, sitemap size. – Elias Dabbas Jul 08 '22 at 19:50
0

Here is a good library: https://github.com/mediacloud/ultimate-sitemap-parser.

Website sitemap parser for Python 3.5+.

Installation:

pip install ultimate-sitemap-parser

Example of extracting all pages of the site nytimes.com from sitemaps:

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage("https://www.nytimes.com/")
for page in tree.all_pages():
    print(page)
Rufat
  • 536
  • 1
  • 8
  • 25
0

Using proper libs with modern Python3: requests and lxml, even with encoding utf8 from the XML declaration:

import requests
from lxml import etree
from pprint import pprint

session = requests.session()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

res = session.get('https://example.org/sitemap-xml', headers=headers)
xml_bytes = res.text.encode('utf-8')

# Parse the XML bytes
root = etree.fromstring(xml_bytes)

# Define the namespace
ns = {'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

urls = root.xpath('//sitemap:url[./sitemap:loc[contains(., "/en-us/")]]', namespaces=ns)

# List comprehension
urls = [u.xpath('./sitemap:loc/text()', namespaces=ns)[0] for u in urls]

pprint(urls)
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
-1

i just had the quest today. I used requests and re (regular expressions) import requests import re

sitemap_url = "https://www.gov.uk/sitemap.xml"
#if you need to send some headers
headers = {'user-agent': 'myApp'}
response = requests.get(sitemap_url,headers = headers)
xml = response.text

list_of_urls = []

for address in re.findall(r"https://.*(?=/</)", xml):
    list_of_urls.append(address+'/')#i add trailing slash, you might want to skip it
wickedpanda
  • 51
  • 1
  • 8