-1

i want to parse a xml from a website, can anyone help me?

This is the xml and i want to get just only information.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
http://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html
</loc>
<news:news>
<news:publication>
<news:name>Haber Gazete</news:name>
<news:language>tr</news:language>
</news:publication>
<news:publication_date>2015-01-29T15:04:01+02:00</news:publication_date>
<news:title>
ÇAYKUR 3 bin 500 personel alımı yapacağını duyurdu! (ÇAYKUR 3 bin 500 personel alım şarları)
</news:title>
</news:news>
<image:image>
<image:loc>
http://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg
</image:loc>
</image:image>
</url>

i try this code for parse but it gives null

conn = client.HTTPConnection("www.habergazete.com")
conn.request("GET", "/sitemaps/1/haberler.xml")
response =  conn.getresponse()
xmlData = response.read()
conn.close()
root = ET.fromstring(xmlData)
print(root.findall("loc"))

any suggestions ?

Thanks :)

ufuk.dogan
  • 19
  • 5

1 Answers1

1

First, the XML you show is not well-formed, so parsing it should raise an exception -- it's missing the final closing '</urlset>'. I suspect you're just not showing us the actual XML you're trying to parse.

Once you fix that (e.g by parsing xmlData + '</urlset>' if the XML data was actually in some way truncated), you're running into a namespace problem, which is easy to show:

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

Yes, it's a very long string, but pretty early in it you'll see:

<ns0:loc>

which shows the loc you're looking for is actually carefully denoted as being in namespace 0 (that's the ns0: prefix).

Third, the docs at https://docs.python.org/2/library/xml.etree.elementtree.html carefully explain, and I quote:

Element.findall() finds only elements with a tag which are direct children of the current element.

My emphasis: you'd only find tags that are direct children of urlset, not generic descendants thereof (children of children, and so on down).

So, expanding the namespace, and using a little xpath syntax to search recursively:

>>> root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
[<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}loc' at 0x1022a50e8>]

...you do finally find the element you were looking for.

BTW, some of us find BeautifulSoup, http://www.crummy.com/software/BeautifulSoup/bs4/doc/ , easier to use for XML parsing tasks when we don't need the extra speed from etree or lxml.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • the xml is going like that, in next line it goes another news and it continues like that, because of this i didn't paste it, if i put '' end of the this xml it is gonna the same, you can imagine like that. because this is rss xml shows me a lot of news, i don't want to put all xml. – ufuk.dogan Jan 29 '15 at 15:10
  • @ufuk.dogan, fine, but you **should** have noticed in your Q's text that little detail -- it's fine, indeed advisable, to snip examples down to the minimum needed for reproduce the problem, but **not**, without notice, down to being incorrect (e.g, malformed XML), because it puts extra load on answerers to notice, diagnose, and fix the problem. Anyway, I continued by showing your namespace problem **and** your need for some `xpath` syntax to search recursively down the tree, and fixing both problems as well as the incorrect truncation, showed a working solution. – Alex Martelli Jan 29 '15 at 15:25
  • Thank you :) This code works well, this helps a lot :D – ufuk.dogan Jan 29 '15 at 15:38