How do I Parse XML Elements and get values with Python 2.7

Question

API Response:http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051 Hello, thanks for help yesterday. However when I attempt to get value from Elements I always get empty value as response. I were refereed this link However not sure I understand it. Where did I wrong and having empty value?

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import codecs
    import sys
    import urllib
    import urllib2
    import re, pprint
    from xml.etree.ElementTree import *
    import csv
    from xml.dom import minidom
    import xml.etree.ElementTree as ET
    import shelve
    import subprocess

    errorCheck = "0"
    isbn = raw_input("Enter IBSN Number Please ")
    isIsbn = len(isbn)

    # ElementTree requires namespace definition to work with XML with namespaces correctly
    # It is hardcoded at this point, but this should be constructed from response.
    namespaces = {
      'dc': 'http://purl.org/dc/elements/1.1/',
      'dcndl': 'http://ndl.go.jp/dcndl/terms/',
    }

    # for prefix, uri in namespaces.iteritems():
        # ElementTree.register_namespace(prefix, uri)

    if isIsbn == 10 or isIsbn == 13:
        errorCheck = 1
        url = "http://iss.ndl.go.jp/api/opensearch?isbn=%s" % isbn
        req = urllib2.Request(url)
        response = urllib2.urlopen(req)
        tree = ET.parse(response)
        root = tree.getroot()
        # root = ET.fromstring(XmlData) 
        print root.findall('dc:title', namespaces)
        print root.findall('dc:title')
        print root.findall('dc:identifier', namespaces)
        print root.findall('dc:identifier')
        print root.findall('identifier')

    if errorCheck == "0":
        print "It is not ISBN"

        # print(root.tag,root.attrib)    

        # for child in root.find('.//item'):
        # print child.text

Padraic Cunningham · Accepted Answer · 2016-09-07T01:38:57.703

Your code needs a slight modification, add .// to your expression in the findall call, the root node is the rss node and the dc:title's are descendants not direct children of the rss node so you need to search through the doc:

import xml.etree.ElementTree as ET
import requests

url = "http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051"
tree = ET.fromstring(requests.get(url).content)
namespaces = {
    'dc': 'http://purl.org/dc/elements/1.1/',
    'dcndl': 'http://ndl.go.jp/dcndl/terms/',
}
[t.text for t in tree.findall('.//dc:title', namespaces)]
[i.text for i in tree.findall('.//dc:identifier', namespaces)]

You can do it very easily using lxml which maps the namespaces for you and can get the source:

In [1]: import lxml.etree as et

In [2]: url = "http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051"

In [3]: tree = et.parse(url)

In [4]: nsmap = tree.getroot().nsmap

In [5]: print(tree.xpath("//dc:title/text()", namespaces=nsmap))
[u'\u9244\u8155\u30a2\u30c8\u30e0']

In [6]: print(tree.xpath("//dc:identifier/text()", namespaces=nsmap))
['4334770053', '95078560']

You can see the path to one of the dc:titles:

In [55]: tree
Out[55]: <Element 'rss' at 0x7f996e8b66d0> # root

In [56]: tree.findall('channel') # child of root so don't need .//
Out[56]: [<Element 'channel' at 0x7f996e131990>]

In [57]: tree.findall('channel/item/dc:title', namespaces) # item is a descendant of rss, item is parent of the dc:title
Out[57]: [<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7f996e131910>]

Same with the identifiers:

In [58]: tree.findall('channel//item//dc:identifier', namespaces)
Out[58]: 
[<Element '{http://purl.org/dc/elements/1.1/}identifier' at 0x7f996e131c50>,
 <Element '{http://purl.org/dc/elements/1.1/}identifier' at 0x7f996e131250>]

Thanks really helped me. – Sakai Kyoutarou Sep 07 '16 at 04:27 — Sakai Kyoutarou, Sep 07 '16 at 04:27

How do I Parse XML Elements and get values with Python 2.7

1 Answers1