2

I am trying to get the titles and links out of the atom_sample.xml I have attached with the same code it was working for other rss feeds.

from lxml import etree
tree = etree.parse('atom_sample.xml')
root = tree.getroot()

titles = root.xpath('//entry/title/text()')
links = root.xpath('//entry/link/@href')
print(titles)
print(links)

Results: [] []

With the other rss file from Issues with python 3.x multiline regex? this was working flawlessly.

Yves
  • 35
  • 3

1 Answers1

3

I think your problem is that lxml.etree parses your xml file with xml namespace {http://www.w3.org/2005/Atom}:

In [1]: from lxml import etree
...: tree = etree.parse('atom_sample.xml')
...: root = tree.getroot()


In [2]: root
Out[2]: <Element {http://www.w3.org/2005/Atom}feed at 0x7f198e8da808>

I am not sure how to get rid of this namespace easly, but you could try one of answers to this question.

Anyway as a workarrond I use to add <namespace>:<tag> to each part of xpath and use xpath method with namespaces dictionary as a parameter. For example:

In [4]: namespaces = {'atom':'http://www.w3.org/2005/Atom'}

In [5]: root.xpath('//atom:entry/atom:title/text()', namespaces=namespaces)
Out[5]: 
['sample.00',
 'sample.01',
 'sample.02',
 'sample.03',
 'sample.04',
 'sample.05',
 'sample.06',
 'sample.07',
 'sample.08',
 'sample.09',
 'sample.10']

 In [6]: root.xpath('//atom:entry/atom:link/@href', namespaces=namespaces)
 Out[6]: 
 ['https://myfeedurl.com/feed/00',
  'https://myfeedurl.com/feed/01',
  'https://myfeedurl.com/feed/02',
  'https://myfeedurl.com/feed/03',
  'https://myfeedurl.com/feed/04',
  'https://myfeedurl.com/feed/05',
  'https://myfeedurl.com/feed/06',
  'https://myfeedurl.com/feed/07',
  'https://myfeedurl.com/feed/08',
  'https://myfeedurl.com/feed/09',
  'https://myfeedurl.com/feed/10']
running.t
  • 5,329
  • 3
  • 32
  • 50