7

I working on xml sax parser to parse xml files and below is my code

xml file code:

<job>
    <title>Registered Nurse-Epilepsy</title>
    <job-code>881723</job-code>
    <detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
    </detail-url>
    <job-category>Neuroscience Nursing</job-category>
    <description>
        <summary>
            <div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
       </summary>
    </description>
    <posted-date>2012-07-26</posted-date>
    <location>
       <address>7777 Forest Lane</address>
       <city>Dallas</city>
       <state>TX</state>
       <zip>75230</zip>
       <country>US</country>
    </location>
    <company>
       <name>Medical City (Dallas, TX)</name>
      <url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
    </company>
</job> 

Python code: (partial code to clear my doubt until start element function)

from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs
    self.clearFields()


  def endElement(self, name):
    pass

  def characters(self, data):
    self.buffer += data

  def clearFields():
    self.fields = {}
    self.fields['title'] = None
    self.fields['job-code'] = None
    self.fields['detail-url'] = None
    self.fields['job-category'] = None
    self.fields['description'] = None
    self.fields['summary'] = None
    self.fields['posted-date'] = None
    self.fields['location'] = None
    self.fields['address'] = None
    self.fields['city'] = None
    self.fields['state'] = None
    self.fields['zip'] = None
    self.fields['country'] = None
    self.fields['company'] = None
    self.fields['name'] = None
    self.fields['url'] = None
    
    self.buffer = ''
      
if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = Exact()
  parser.setContentHandler(handler)
  parser.parse(open('/path/to/xml_file.xml'))

result: The result to the above print statement is given below

job     <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description  <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location      <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state         <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>

As you can observe above i am getting name and attrs from the print statement, but now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.

Edited Code:

i really confused on how to map the data from the nodes to the keys in the dictionary as stated above

Hans Ginzel
  • 8,192
  • 3
  • 24
  • 22
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

3 Answers3

9

To get the content of an element, you need to overwrite the characters method... add this to your handler class:

def characters(self, data):
    print data

Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

    def _flushCharBuffer(self):
        s = ''.join(self._charBuffer)
        self._charBuffer = []
        return s

    def characters(self, data):
        self._charBuffer.append(data)

... and then call the flush method on the end of elements where I need the data.

For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []
        self._result = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip() #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)
        return self._result

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if name == 'job': self._result.append({})

    def endElement(self, name):
        if not name == 'job': self._result[-1][name] = self._getCharacterData()

jobs = MyHandler().parse("job-file.xml") #a list of all jobs

If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement method - just set _result to a dict and assign to it directly in endElement.

l4mpi
  • 5,103
  • 3
  • 34
  • 54
3

To get the text content of a node, you need to implement a characters method. E.g.

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs


  def endElement(self, name):
    print 'end ' + name

  def characters(self, content):
    print content

Would output:

job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>



title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title



job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code



detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance



end detail-url

(sniped)

Gary van der Merwe
  • 9,134
  • 3
  • 49
  • 80
2

You need to implement a characters handler too:

def characters(self, content):
    print content

but this potentially gives you text in chunks instead of as one block per tag.

Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.

from xml.etree import ElementTree as ET

etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text

If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.

If you have a set of existing elements you want to look for, just use these in the find() method:

fieldnames = [
    'title', 'job-code', 'detail-url', 'job-category', 'description',
    'summary', 'posted-date', 'location', 'address', 'city', 'state',
    'zip', 'country', 'company', 'name', 'url']
fields = {}

etree = ET.parse('/path/to/xml_file.xml')

for field in fieldnames:
    elem = etree.find(field)
    if field is not None and field.text is not None:
        fields[field] = elem.text
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Martijn Pieters : I edited my code above please have a look at it once, after fetching how to map those values by creating a dictionary with keys as node names and values as node values – Shiva Krishna Bavandla Sep 04 '12 at 13:33
  • here what is the data actually whether it is content in character function ? . Actually the fields in the dictionary are hard cored fields, because i will run this code for multiple xml urls actually, if the field in the dictionary matches the xml node then the result of the node must be mapped to that dictionary field. this is the actual concept finally – Shiva Krishna Bavandla Sep 04 '12 at 13:44
  • Not sure what you are asking there, but the code above is easy enough to generalize. Note that SO is not a code writing service. :-) – Martijn Pieters Sep 04 '12 at 13:46
  • hey i am sorry and thats true Martijn , i can able to understand, also u helped a lot, actually what i am trying is declaring some fields in the dictionary, after fetching data using xml sax , i need to map those fetched data to the fields in the dictionary if xml node matched the field in dictionary – Shiva Krishna Bavandla Sep 04 '12 at 13:51
  • Yup, I understood, and I showed you a much easier way to do this with ElementTree instead. Sorry, I rarely if ever use the SAX interfaces. – Martijn Pieters Sep 04 '12 at 13:51
  • k i will use stree then,for example i need to parse the following url http://northshorelij.jobs/feed/xml which has huge amount of data and need to map the nodes with thier data – Shiva Krishna Bavandla Sep 04 '12 at 13:57
  • is etree helps to parse these kind of urls – Shiva Krishna Bavandla Sep 04 '12 at 13:58
  • @Kouripm: why don't you try it? etree is a pythonic API to parse XML, so it'll parse that URL. It certainly will be easier than using the SAX API. – Martijn Pieters Sep 04 '12 at 14:00
  • Do yourself a big favour and learn how to use a SAX parser. Element Tree has its own uses, but this one is clearly a case for SAX. – l4mpi Sep 04 '12 at 14:37
  • @l4mpi: Are the files that big? There are incremental ElementTree parsers available too, I'd rather use those still. – Martijn Pieters Sep 04 '12 at 14:38
  • You have a huge list of field names that gets searched for - this is completely unneccessary with a sax parser. I've got a boilerplate SAX parser class which solves his problems in ~3 lines, without knowing the field names. Also, what do you do if he has multiple `` elements in his file? – l4mpi Sep 04 '12 at 14:42
  • @l4mpi: The 'huge' requirement wasn't given in the OP initial question, and I was trying to teach a man to fish, not just drown in an API that he doesn't understand. Sure, SAX has it's place, but the initial (small) problem wasn't it. – Martijn Pieters Sep 04 '12 at 14:44
  • @l4mpi: Also, meta note: the large list came from the OPs post as well; he wanted just a way to put things into a dict quickly. That's not a solution for the large file case, of course. – Martijn Pieters Sep 04 '12 at 14:46
  • this is my point, for just putting things into a dict quickly SAX is the way to go. Well maybe I'm biased, I'd use SAX for everything but cases where DOM manipulation is required. – l4mpi Sep 04 '12 at 14:48
  • @l4mpi: Right. I've *written* DOM parsers, and I definitely prefer easier-to-use APIs when I can get away with them. :-) As this particular (bad) question progressed, it only gradually became clear SAX might be a better fit, but I might instead try and use [lxml iterparse](http://lxml.de/parsing.html#iterparse-and-iterwalk) instead. – Martijn Pieters Sep 04 '12 at 14:53
  • I would have recommended him ElementTree too, but the question is about SAX parsing. oh, 8 years ago.. – Alex M.M. Nov 02 '20 at 19:56