Extracting data from a text file and writing it to csv or flat file

Question

I'm doing a project that involves creating a rdbms of US federal code in a certain format. I've obtained the whole code form official source which is not structured well. I have managed to scrape the US Code in the below format into text files using some code on GITHUB.

Can this be done using a Python script to write this to some csv or flat file in the below format?

I'm new to Python but I'm told that this can easily be done using Python.

End output would be a flat file or a csv file with the below schema:

Example:

**Title | Text | Chapter | text | Section | Text | Section text**


1     |  GENERAL PROVISIONS  |  1 | RULES OF CONSTRUCTION | 2 | "County" as including "parish", and so forth | The word "county" includes a parish, or any other equivalent subdivision of a State or Territory of the United States.

Input would be a text file with data that looks like below.

Sample data:

-CITE-
    1 USC Sec. 2                                                01/15/2013

-EXPCITE-
    TITLE 1 - GENERAL PROVISIONS
    CHAPTER 1 - RULES OF CONSTRUCTION

-HEAD-
    Sec. 2. "County" as including "parish", and so forth

-STATUTE-
      The word "county" includes a parish, or any other equivalent
    subdivision of a State or Territory of the United States.

-SOURCE-
    (July 30, 1947, ch. 388, 61 Stat. 633.)

-End-



-CITE-
    1 USC Sec. 3                                                01/15/2013

-EXPCITE-
    TITLE 1 - GENERAL PROVISIONS
    CHAPTER 1 - RULES OF CONSTRUCTION

-HEAD-
    Sec. 3. "Vessel" as including all means of water transportation

-STATUTE-
      The word "vessel" includes every description of watercraft or
    other artificial contrivance used, or capable of being used, as a
    means of transportation on water.

-SOURCE-
    (July 30, 1947, ch. 388, 61 Stat. 633.)

-End-

senshin · Accepted Answer · 2014-01-21T15:43:03.263

2

If you wanted to use a robust parser like pyparsing rather than regexes, the following should work for you:

import csv, re
from pyparsing import Empty, FollowedBy, Group, LineEnd, Literal, \
                      OneOrMore, Optional, Regex, SkipTo, Word
from pyparsing import alphanums, alphas, nums

def section(header, other):
    return Literal('-'+header+'-').suppress() + other

def tc(header, next_item):
    # <header> <number> - <name>
    begin = Literal(header).suppress()
    number = Word(nums)\
             .setResultsName('number')\
             .setParseAction(compress_whitespace)
    dash = Literal('-').suppress()
    name = SkipTo(Literal(next_item))\
           .setResultsName('name')\
           .setParseAction(compress_whitespace)
    return begin + number + dash + name

def compress_whitespace(s, loc, toks):
    return [re.sub(r'\s+', ' ', tok).strip() for tok in toks]

def parse(data):
    # should match anything that looks like a header
    header = Regex(re.compile(r'-[A-Z0-9]+-'))

    # -CITE- (ignore)
    citation = SkipTo('-EXPCITE-').suppress()
    cite_section = section('CITE', citation)

        # -EXPCITE- (parse)
    # grab title number, title name, chapter number, chapter name
    title = Group(tc('TITLE', 'CHAPTER'))\
            .setResultsName('title')
    chapter = Group(tc('CHAPTER', '-HEAD-'))\
              .setResultsName('chapter')
    expcite_section = section('EXPCITE', title + chapter)

    # -HEAD- (parse)
    # two possible forms of section number:
    # > Sec. 1. <head_text>
    # > CHAPTER 1 - <head_text>
    sec_number1 = Literal("Sec.").suppress() \
                  + Regex(r'\d+\w?.')\
                    .setResultsName('section')\
                    .setParseAction(lambda s, loc, toks: toks[0][:-1])
    sec_number2 = Literal("CHAPTER").suppress() \
                  + Word(nums)\
                    .setResultsName('section') \
                  + Literal("-")
    sec_number = sec_number1 | sec_number2
    head_text = SkipTo(header)\
                .setResultsName('head')\
                .setParseAction(compress_whitespace)
    head = sec_number + head_text
    head_section = section('HEAD', head)

    # -STATUTE- (parse)
    statute = SkipTo(header)\
              .setResultsName('statute')\
              .setParseAction(compress_whitespace)
    statute_section = section('STATUTE', statute)

    # -End- (ignore)
    end_section = SkipTo('-End-', include=True)

    # do parsing
    parser = OneOrMore(Group(cite_section \
                             + expcite_section \
                             + head_section \
                             + Optional(statute_section) \
                             + end_section))
    result = parser.parseString(data)

    return result

def write_to_csv(parsed_data, filename):
    with open(filename, 'w') as f:
        writer = csv.writer(f, lineterminator='\n')
        for item in parsed_data:
            if 'statute' not in item:
                continue
            row = [item['title']['number'],
                   item['title']['name'],
                   item['chapter']['number'],
                   item['chapter']['name'],
                   item['section'],
                   item['head'],
                   item['statute']]
            writer.writerow(row)



# your data is assumed to be in <source.txt>
with open('source.txt', 'r') as f:
    data = f.read()
result = parse(data)
write_to_csv(result, 'output.txt')

Output: see http://pastie.org/8654063.

This is certainly more verbose than using regexes, but it's also more maintainable and extensible in my opinion. (Granted, this comes with the overhead of learning how to do basic manipulations in pyparsing, which isn't necessarily trivial.)

In response to your request - I have updated the parser to accomodate all the text that appears in the file you linked me. It should now be more robust against unusual line breaks / punctuation.

As you requested, the citations that have an enumeration of sections (and lack a -STATUTE- section) are no longer included in the output.

edited Jan 21 '14 at 15:43

answered Jan 20 '14 at 17:59

senshin

10,022
7
46
59

After installing pyparsing, When I run the above script; it gives me the below errors. Could you please help me resolve these? `Traceback (most recent call last): File "1usc1.py", line 71, in result = parse(data) File "1usc1.py", line 51, in parse result = parser.parseString(data) File "/Library/Python/2.7/site-packages/pyparsing.py", line 1041, in parseString raise exc pyparsing.ParseException: Expected "-EXPCITE-" (at char 26), (line:2, col:19)` – koder Jan 21 '14 at 12:06
@neo Can you post a link to the data file you're working off of? This presumably means that one of your citations lacks an -EXPCITE- section. – senshin Jan 21 '14 at 12:09
Thanks for the quick response. I'm supposing that all the tags are not present in each of the sections. Here's the link to one of the 51 text files I'm trying to parse. http://pastebin.com/m7gZh23A – koder Jan 21 '14 at 12:13
@neo Cool. I'll work on this and ping you when I update my answer. – senshin Jan 21 '14 at 12:14
@neo I've updated my answer. Try it out and see if you encounter any issues. – senshin Jan 21 '14 at 13:01
this is excellent. The output is almost exactly the same as my expectation; except that "NO STATUTE" row for each chapter. Can we omit that from the output? I have read your update note below the code, at the beginning of each chapter, all the sections present under it are enumerated and hence that part would not have a -STATUTE- tag. I'm sorry if I have confused you with different inputs. – koder Jan 21 '14 at 14:05
@neo Ah, okay, I didn't realize that that was what was going on. This is a pretty trivial change to the code, which I'm about to update. – senshin Jan 21 '14 at 14:09
Awesome! One last query - whereever there are subsections for the code, like 106 a, 106 b... the entries are given as: `1 GENERAL PROVISIONS 2 ACTS AND RESOLUTIONS; FORMALITIES OF ENACTMENT; REPEALS; SEALING OF INSTRUMENTS 10 a. Promulgation of laws` can we concatenate the subsections like a, b and c or what ever into the main section text instead of having separate rows for them? or is it possible to get the whole section number like 106a, 106b, etc.. so that the data would be consistent all over? – koder Jan 21 '14 at 15:26
I mean to say that instead of 106a and 106b, the script prints 10 in a column and the alphabet a or b or whatever is concatenated with the text.. which is fine if the 106 is printed at least. – koder Jan 21 '14 at 15:32
Thanks a lot for taking time on this and sorry for delayed reply; I have been trying to implement the code and parse all the 51 titles and see if all of them would parse rightly. It is taking time for QA-ing these. Some titles are quite large and have sub parts also hence this program doesn't seem to work here. I am uploading these titles to dropbox and will provide you the link for your reference and write back in a couple of hours in detail as to which ones I was not able to parse. Again, thank you very much. – koder Jan 22 '14 at 19:27
Here's the link to the folder with all the tiles (titles being uploaded gradually) https://www.dropbox.com/sh/258u2aczir7nqaq/VQuVU8E4RN – koder Jan 22 '14 at 19:33
Senshin, it seems Title 42 is the largest and most complex of titles and it has the superset of all possible -TAGS- that may be used across titles like **chapter, sub chapter, part, sub part,** etc., Since some of the sections were REPEALED, OMITTED, TRANSFERRED etc which is mentioned in -HEAD- tag, the -STATUTE- tag for those sections isn't present. Is it possible to include this functionality in code and mark the REPEALED, OMITTED or TRANSFERRED etc sections as REPEALED, OMITTED or TRANSFERRED in the statute column? Title 42 uploaded in the above link would give you a good idea about this. – koder Jan 22 '14 at 20:54
I think the above is the only reason why the above code returned error or empty files while parsing some titles. Thanks a lot again for your time on this. – koder Jan 22 '14 at 20:54
@neo Sorry, I don't think I have any more time to work on this. – senshin Jan 22 '14 at 21:15
Senshin, I have tried to tweak the code to ignore the parts of files without -STATUTE- but in vain. If you don't mind could you please guide me to tweak this? – koder Jan 23 '14 at 18:47
I worked around the above issue, but a new one came up. Chapter Index could be numbers as well as alphanumerics, above code is returning error when I parse some of the titles where chapter numbers are like 9A, 9C, etc. Could you please tweak the code to fit this requirement? Thanks in advance. – koder Jan 24 '14 at 20:28

score 0 · Answer 2 · edited May 23 '17 at 11:49

1.loop over files lines

with open('workfile', 'r') as f:
    for line in f:
        ...

2.use python re to match one of ['CITE', 'EXPCITE', 'HEAD'...]

3.based on the line matched at 2, also use python re to match line content, consider having those matchers ready in some dict

d = {'EXPCITE': re.compile(pattern)}
# and then latter
m = d['EXPCITE'].match(string)
# get the relevant group, for exmaple
print m.group(0)

4.write to csv output file

with open('out.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter='|')
    writer.writerow([...])

also, consider implementing a state machine to switch between point 2 to 3 above, see Python state-machine design using this technique you can switch between looking for a tag as described in point 2 to matching the content of the tag as described in point 3

Good luck!

thanks for taking time to answer. I'm really a beginner and won't be able to implement the above instructions in code. — koder, Jan 21 '14 at 12:57

Extracting data from a text file and writing it to csv or flat file

2 Answers2