Crawling sitemap.xml via python

Question

I am crawling a sitemap.xml and my objective is to find all the url's and the incremental count of them.

Below is the structure of the xml

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.htcysnc.com/m/designer-sarees</loc>
        <lastmod>2014-09-01</lastmod>
    <changefreq>hourly</changefreq>
    <priority>0.9</priority>
</url>
<url>
    <loc>http://www.htcysnc.com/m/anarkali-suits</loc>
    <lastmod>2014-09-01</lastmod>
    <changefreq>hourly</changefreq>
    <priority>0.9</priority>
</url>

Below is my code

from BeautifulSoup import BeautifulSoup
import requests
import gzip
from StringIO import StringIO


def crawler():
    count=0
    url="http://www.htcysnc.com/sitemap/sitemap_product.xml.gz"
    old_xml=requests.get(url)
    new_xml=gzip.GzipFile(fileobj=StringIO(old_xml.content)).read()
    #new_xml=old_xml.text
    final_xml=BeautifulSoup(new_xml)
    item_to_be_found=final_xml.findAll('loc')
    for i in item_to_be_found:
        count=count+1
        print i
        print count
    crawler()

My output is like this

<loc>http://www.htcysnc.com/elegant-yellow-green-suit-seven-east-p63703</loc>
1
<loc>http://www.htcysnc.com/elegant-orange-pink-printed-suit-seven-east-p63705</loc>
2

Need the output as links without loc and /loc. Have tried replace command but that is throwing an error.

Sorry, the post got messed up. Have edited it now. – Tushar Bakaya Jul 07 '15 at 06:25 — Tushar Bakaya, Jul 07 '15 at 06:25

score 5 · Accepted Answer · edited May 23 '17 at 11:53

5

Here every item in item_to_be_found list is a Tag type object so you can get the string inside <loc> tag using .text or .string on them. Though .text and .string have differences both will work same in this case.

for loc in item_to_be_found:
    print item_to_be_found.index(loc) + 1, loc.text

this will give you a result like

1 http://www.htcysnc.com/m/designer-sarees
2 http://www.htcysnc.com/m/anarkali-suits

edited May 23 '17 at 11:53

Community

1
1

answered Jul 07 '15 at 06:25

salmanwahed

9,450
7
32
55

why do i keep getting loc undefined? – halo09876 Jan 31 '17 at 08:32

score 0 · Answer 2 · answered Jul 16 '18 at 17:18

0

Instead of a loop, you can use some attributes instead that may make your code a little faster.

print i.text.strip()

That should give you the necessary information without any tags.

answered Jul 16 '18 at 17:18

EcoEffect0

41
8

Crawling sitemap.xml via python

2 Answers2