Use lxml find element to parse Arxiv XML from API

Question

I'm fetching xml from Arxiv and I want to parse all Arxiv entries after reading it using lxml library. Here is my code to grab XML file of 100 of articles.

import urllib
from lxml import etree

start_index = 0
results_per_iteration = 100
base_url = 'http://export.arxiv.org/api/query?'
search_query = 'cat:cs.CV+OR+cat:cs.LG+OR+cat:cs.CL+OR+cat:cs.NE+OR+cat:stat.ML'
query = 'search_query=%s&sortBy=lastUpdatedDate&start=%i&max_results=%i'\
    % (search_query, start_index, start_index + results_per_iteration)

response = urllib.request.urlopen(base_url + query).read() # python 3.x
# response = urllib.urlopen(base_url + query).read() # python 2.x
tree = etree.fromstring(response)

Now, I have to do as following to find all entries from the xml.

e_ = tree.findall('{http://www.w3.org/2005/Atom}entry')

And in order to find id, I have to do the following

print(e_.find('{http://www.w3.org/2005/Atom}id').text)

Question is I want to know if there is a way to parse this XML where we don't have to provide {http://www.w3.org/2005/Atom} when finding elements i.e. tree.findall('entry') or lxml script that has some functionality that works similar to feedparser.

score 1 · Answer 1 · answered Sep 28 '16 at 01:06

1

You can use the following XPath expression to match element by its local-name, ignoring the namespace :

e_ = tree.xpath('*[local-name()="entry"]')

answered Sep 28 '16 at 01:06

har07

88,338
12
84
137

Thanks har07! I didn't know this trick before. This works perfect. – titipata Sep 28 '16 at 01:48

Use lxml find element to parse Arxiv XML from API

1 Answers1