Python, extract urls from xml sitemap that contain a certain word

Question

I'm trying to extract all urls from a sitemap that contain the word foo in the url. I've managed to extract all the urls but can't figure out how to only get the ones I want. So in the below example I only want the urls for apples and pears returned.

<url>
<loc>
https://www.example.com/p-1224-apples-foo-09897.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-1433-pears-foo-00077.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-3411-oranges-ping-66554.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>

score 2 · Accepted Answer · answered Sep 30 '18 at 12:52

I modify the xml to valid format (add <urls> and </urls>), save them into src.xml:

<urls>
<url>
<loc>
https://www.example.com/p-1224-apples-foo-09897.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-1433-pears-foo-00077.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.example.com/p-3411-oranges-ping-66554.php
</loc>
<lastmod>2018-05-29</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
</urls>

Use xml.etree.ElementTree to parse xml:

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('src.xml')
>>> root = tree.getroot()
>>> for url in root.findall('url'):
...     for loc in url.findall('loc'):
...             if loc.text.__contains__('foo'):
...                     print(loc.text)
...

https://www.example.com/p-1224-apples-foo-09897.php
https://www.example.com/p-1433-pears-foo-00077.php

score 1 · Answer 2 · answered Sep 30 '18 at 10:49

1

Assuming they are always in elements loc tagged then you can use an XPath method

//loc[contains(text(),'foo')]

Generic would be:

//*[contains(text(),'foo')]

It requires using lxml which supports XPath, see here.

answered Sep 30 '18 at 10:49

QHarr

83,427
12
54
101

score 1 · Answer 3 · answered Sep 30 '18 at 13:25

1

If you have all the urls then you can check for each url if the word "foo" is in it by using in. Something like this (assuming you already have all the urls in a list called urls):

urls = [url for url in urls if 'foo' in url]

answered Sep 30 '18 at 13:25

teller.py3

822
8
22

score 0 · Answer 4 · answered Feb 22 '21 at 10:16

from xml.dom.minidom import parse
import xml.dom.minidom
xml_file = r'your_file.xml'
DOMTree = xml.dom.minidom.parse(xml_file)
root_node = DOMTree.documentElement
print(root_node.nodeName)
loc_nodes = root_node.getElementsByTagName("loc")
for loc in loc_nodes:
    print(loc.childNodes[0].data)

Python, extract urls from xml sitemap that contain a certain word

4 Answers4