I'm new to python and have been taking on various projects to get up to speed. At the moment, I'm working on a routine that will read through the Code of Federal Regulations and for each paragraph, print the organizational hierarchy for that paragraph. For example, a simplified version of the CFR's XML scheme would look like:
<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
<SECTION>
<SECTNO>### 229.120</SECTNO>
<SUBJECT>Transfers of property.</SUBJECT>
<P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
</SECTION>
I'd like to be able to print this to a CSV so that I can run text analysis:
Title 22, Volume 2, Part 229, Section 228.120, If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).
Note that I'm not taking the Title and Volume numbers from the XML, because they are actually included in the file name in a much more standardized format.
Because I'm such a Python newbie, the code is mostly based on the search-engine code from Udacity's computer science course. Here's the Python I've written/adapted so far:
import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()
def clean_title(file_name): #Gets the title number from the filename.
start_title = file_name.find('title')
end_title = file_name.find("-", start_title+1)
title = file_name[start_title+5:end_title]
return title
def clean_volume(file_name): #Gets the volume number from the filename.
start_volume = file_name.find('vol')
end_volume = file_name.find('.xml', start_volume)
volume = file_name[start_volume+3:end_volume]
return volume
def get_next_section(page): #Gets all of the text between <SECTION> tags.
start_section = page.find('<SECTION')
if start_section == -1:
return None, 0
start_text = page.find('>', start_section)
end_quote = page.find('</SECTION>', start_text + 1)
section = page[start_text + 1:end_quote]
return section, end_quote
def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
start_section_number = section.find('<SECTNO>###')
if start_section_number == -1:
return None, 0
end_section_number = section.find('</SECTNO>', start_section_number)
section_number = section[start_section_number+11:end_section_number]
return section_number, end_section_number
def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
start_paragraph = section.find('<P>')
if start_paragraph == -1:
return None, 0
end_paragraph = section.find('</P>', start_paragraph)
paragraph = section[start_paragraph+3:end_paragraph]
return start_paragraph, paragraph, end_paragraph
def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
section, endpos = get_next_section(page)
for pragraph in section:
title = clean_title(file_name)
volume = clean_volume(file_name)
section, endpos = get_next_section(page)
section_number, end_section_number = get_section_number(section)
start_paragraph, paragraph, end_paragraph = get_paragraph(section)
if paragraph:
print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
page = page[end_paragraph:]
else:
break
print print_all_paragraphs(page)
doc.close()
At the moment, this code has the following issues (example output to follow):
- It prints the first paragraph multiple times. How can I print each
tag with its own title number, volume number, etc?
- The CFR has empty sections that are "Reserved". These sections don't have
tags, so the if loop breaks. I've tried implementing for/while loops, but for some reason when I do this the code then just prints the first paragraph it finds repeatedly.
Here's an example of the output:
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member
of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number: 9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number: 9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
None
Ideally, each of the entries after the citation information would be different.
What kind of loop should I run to print this properly? Is there a more "pythonic" way of doing this kind of text extraction?
I understand that I am a complete novice, and one of the major problems I'm facing is that I simply don't have the vocabulary or topic knowledge to really find detailed answers about parsing XML with this level of detail. Any recommended reading would also be welcome.