I'm fetching xml from Arxiv and I want to parse all Arxiv entries after reading it using lxml
library. Here is my code to grab XML file of 100 of articles.
import urllib
from lxml import etree
start_index = 0
results_per_iteration = 100
base_url = 'http://export.arxiv.org/api/query?'
search_query = 'cat:cs.CV+OR+cat:cs.LG+OR+cat:cs.CL+OR+cat:cs.NE+OR+cat:stat.ML'
query = 'search_query=%s&sortBy=lastUpdatedDate&start=%i&max_results=%i'\
% (search_query, start_index, start_index + results_per_iteration)
response = urllib.request.urlopen(base_url + query).read() # python 3.x
# response = urllib.urlopen(base_url + query).read() # python 2.x
tree = etree.fromstring(response)
Now, I have to do as following to find all entries from the xml.
e_ = tree.findall('{http://www.w3.org/2005/Atom}entry')
And in order to find id
, I have to do the following
print(e_.find('{http://www.w3.org/2005/Atom}id').text)
Question is I want to know if there is a way to parse this XML where we don't have to provide {http://www.w3.org/2005/Atom}
when finding elements i.e. tree.findall('entry')
or lxml
script that has some functionality that works similar to feedparser.