1

I am trying to use lxml.etree.parse and tree.xpath in django to parse some content from an external rss feed. But for some reason, I am unable to get any results. I've been able to use the below method before with success on other xml files but seem to be having difficultis with this one.

Here is what the xml file looks like that I am trying to scrape from:

<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Open Library : Author Name</title>
    <link href="http://www.somedomain.org/people/atom/author_name" rel="self"/>
    <updated>2012-03-20T16:41:00Z</updated>
    <author>
        <name>somedomain.org</name>
    </author>
    <id>tag:somedomain.org,2007:/person_feed/123456</id>
    <entry>
        <link href="http://www.somedomain.org/roll_call/show/1234" rel="alternate"/>
        <id>
        tag:somedomain.org,2012-03-20:/roll_call_vote/1234
        </id>
        <updated>2012-03-20T16:41:00Z</updated>
        <title>Once upon a time</title>
        <content type="html">
        This os a book full of words
        </content>
    </entry>
</feed>

Here is what my view in django looks like:

def openauthors(request):

    tree = lxml.etree.parse("http://www.somedomain.org/people/atom/author_name")
    listings = tree.xpath("//author")

    listings_info = []

    for listing in listings:
        this_value = {
            "name":listing.findtext("name"),
            }

        listings_info.append(this_value)


    json_listings = '{"listings":' + simplejson.dumps(listings_info) + '}'

    if("callback" in request.GET.keys()):
        callback = request.GET["callback"]
    else:
        callback = None

    if(callback):
        response = HttpResponse("%s(%s)" % (
                callback,
                simplejson.dumps(listings_info)
                ), mimetype="application/json"
            )
    else:
        response = HttpResponse(json_listings, mimetype="application/json")
    return response

I have also tried some of the following paths in hopes that they might help but have had no success.

    listings = tree.xpath("feed/author")
    listings = tree.xpath("/feed/author")
    listings = tree.xpath("/author")
    listings = tree.xpath("author")

Any help in the right direction would be appreciated.

bigmike7801
  • 3,908
  • 9
  • 49
  • 77

1 Answers1

0

Maybe the problem is about namespaces. The lxml module prepends the namespace names at the start of tag names, so maybe the problem is that the xpath expressions are not matching this namespace prefix. If you iterate over the elements looking at the tag names, and you get something like this, then this is the problem:

>>> for element in tree:
...     element
[...]
<Element {http://www.w3.org/2005/Atom}author at 7f14e75d1788>
[...]

Check out that prefix "{http://www.w3.org/2005/Atom}" before the tagname "author". If so, have a look here:

Need Help using XPath in ElementTree and here:

python: xml.etree.ElementTree, removing "namespaces"

And also check out the official documentation because maybe there's an option for parsing without namespace prefixes.

GL.

Community
  • 1
  • 1
  • Good sugguestion but I wasn't able to get anything to work. I tried some different variations of some code examples but with no success. I added `namespace = "{http://www.w3.org/2005/Atom}"` `listings = tree.findall('{%s}author/' % namespace)` and removed `listings = tree.xpath("//author")` but that seem to end in the same result :( – bigmike7801 Mar 22 '12 at 15:03
  • Namespaces did and up being the issue but your solution wasn't quite what I was looking for. Thanks for the help though! – bigmike7801 Mar 22 '12 at 18:54