How Can i Parse XML using Python

Question

i want to parse a xml from a website, can anyone help me?

This is the xml and i want to get just only information.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
http://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html
</loc>
<news:news>
<news:publication>
<news:name>Haber Gazete</news:name>
<news:language>tr</news:language>
</news:publication>
<news:publication_date>2015-01-29T15:04:01+02:00</news:publication_date>
<news:title>
ÇAYKUR 3 bin 500 personel alımı yapacağını duyurdu! (ÇAYKUR 3 bin 500 personel alım şarları)
</news:title>
</news:news>
<image:image>
<image:loc>
http://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg
</image:loc>
</image:image>
</url>

i try this code for parse but it gives null

conn = client.HTTPConnection("www.habergazete.com")
conn.request("GET", "/sitemaps/1/haberler.xml")
response =  conn.getresponse()
xmlData = response.read()
conn.close()
root = ET.fromstring(xmlData)
print(root.findall("loc"))

any suggestions ?

Thanks :)

try `print(xmlData)` first? make sure you get the data. – laike9m Jan 29 '15 at 14:55 — laike9m, Jan 29 '15 at 14:55
i am sure, i can get all data :) – ufuk.dogan Jan 29 '15 at 15:12 — ufuk.dogan, Jan 29 '15 at 15:12

Alex Martelli · Accepted Answer · 2015-01-29T15:21:05.627

First, the XML you show is not well-formed, so parsing it should raise an exception -- it's missing the final closing '</urlset>'. I suspect you're just not showing us the actual XML you're trying to parse.

Once you fix that (e.g by parsing xmlData + '</urlset>' if the XML data was actually in some way truncated), you're running into a namespace problem, which is easy to show:

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

Yes, it's a very long string, but pretty early in it you'll see:

<ns0:loc>

which shows the loc you're looking for is actually carefully denoted as being in namespace 0 (that's the ns0: prefix).

Third, the docs at https://docs.python.org/2/library/xml.etree.elementtree.html carefully explain, and I quote:

Element.findall() finds only elements with a tag which are direct children of the current element.

My emphasis: you'd only find tags that are direct children of urlset, not generic descendants thereof (children of children, and so on down).

So, expanding the namespace, and using a little xpath syntax to search recursively:

>>> root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
[<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}loc' at 0x1022a50e8>]

...you do finally find the element you were looking for.

BTW, some of us find BeautifulSoup, http://www.crummy.com/software/BeautifulSoup/bs4/doc/ , easier to use for XML parsing tasks when we don't need the extra speed from etree or lxml.

the xml is going like that, in next line it goes another news and it continues like that, because of this i didn't paste it, if i put '' end of the this xml it is gonna the same, you can imagine like that. because this is rss xml shows me a lot of news, i don't want to put all xml. — ufuk.dogan, Jan 29 '15 at 15:10
@ufuk.dogan, fine, but you **should** have noticed in your Q's text that little detail -- it's fine, indeed advisable, to snip examples down to the minimum needed for reproduce the problem, but **not**, without notice, down to being incorrect (e.g, malformed XML), because it puts extra load on answerers to notice, diagnose, and fix the problem. Anyway, I continued by showing your namespace problem **and** your need for some `xpath` syntax to search recursively down the tree, and fixing both problems as well as the incorrect truncation, showed a working solution. — Alex Martelli, Jan 29 '15 at 15:25

How Can i Parse XML using Python

1 Answers1