How to parse XML hierarchies with python?

Question

I'm new to python and have been taking on various projects to get up to speed. At the moment, I'm working on a routine that will read through the Code of Federal Regulations and for each paragraph, print the organizational hierarchy for that paragraph. For example, a simplified version of the CFR's XML scheme would look like:

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

I'd like to be able to print this to a CSV so that I can run text analysis:

Title 22, Volume 2, Part 229, Section 228.120, If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).

Note that I'm not taking the Title and Volume numbers from the XML, because they are actually included in the file name in a much more standardized format.

Because I'm such a Python newbie, the code is mostly based on the search-engine code from Udacity's computer science course. Here's the Python I've written/adapted so far:

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

At the moment, this code has the following issues (example output to follow):

It prints the first paragraph multiple times. How can I print each
tag with its own title number, volume number, etc?
The CFR has empty sections that are "Reserved". These sections don't have
tags, so the if loop breaks. I've tried implementing for/while loops, but for some reason when I do this the code then just prints the first paragraph it finds repeatedly.

Here's an example of the output:

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled ‚ÄúClassification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.‚Äù
None

Ideally, each of the entries after the citation information would be different.

What kind of loop should I run to print this properly? Is there a more "pythonic" way of doing this kind of text extraction?

I understand that I am a complete novice, and one of the major problems I'm facing is that I simply don't have the vocabulary or topic knowledge to really find detailed answers about parsing XML with this level of detail. Any recommended reading would also be welcome.

You import an xml parser, but don't seem to use it (and are trying to write your own parser). You really don't want to do to that, using an existing parser is a much better idea. I would suggest reading this http://stackoverflow.com/questions/1912434/easiest-way-to-parse-xml-in-python question. — tacaswell, Feb 14 '13 at 03:44

score 0 · Accepted Answer · answered Feb 14 '13 at 04:29

0

I like to solve problems like this with XPATH or XSLT. You can find a great implementation in lxml (not in standard distro, needs to be installed). For instance, the XPATH //CHAPTER/HD/SECTION[SECTNO] selects all sections with data. You use relative XPATH statements to grab the values you want from there. Multiple nested for loops disappear. XPATH has a bit of a learning curve, but there many examples out there.

answered Feb 14 '13 at 04:29

tdelaney

73,364
6
83
116

I second the motion for ``lxml``. It's WAY better than trying to do your own text processing and I've found it far better than the lower-level built-in libraries for the many times I've worked with XML over the years. – scanny Feb 14 '13 at 07:06

How to parse XML hierarchies with python?

1 Answers1