At a standstill parsing some statutes with regex

Question

I'm parsing a huge file of statutes and I have a specific regex for non-standard statutes because they don't match the usual pattern. Here's the regex I'm using:

\n(\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?)(?= (?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)\.\s*\n)(?:\s|\stt.*|\.)(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed).\s*\n(.*?)\n\d*[A-Z]?-\d*[A-Z]?-\d*[\.\d]*[A-Z]?[-\d*[\.\d]*[A-Z]?]?

This works great except for a few problem cases.

It doesn't work when two special cases appear right after each other; ex:

34A-1-28 Repealed. 34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed. 34A-1-28. Repealed by SL 1986, ch 295, § 7.
It doesn't work when the statute appears like so: 34A-6-88, Transferred. (comma after statute)
It doesn't work when there's a range listed: 34A-6-88 to 23-34-1A Repealed.

Any help solving these three problems would be appreciated. I've set up a regex101 which includes a chunk of the statutes I'm looking to flag here for convenience.

What makes you think this is a regular grammar that you're parsing? — Wooble, Jul 20 '14 at 19:05
I'm kind of a regular expressions newbie but I'm not really sure what either of you are asking. — allocate, Jul 20 '14 at 19:44
@JoshuaAuriemma Woobie is asking whether you are sure the thing you are trying to parse is a regular language or else it will not be parseable with regular expressions. As an example (html), see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Tommy, Jul 20 '14 at 20:02
@JoshuaAuriemma I think Braj was alluding to the fact that it will be very difficult for someone to figure out your problem since we are not sure what matches are supposed to look like from your Question and the regex will be a nightmare to sift through in that format. — Tommy, Jul 20 '14 at 20:04
@Tommy, Ah, thanks for that. I guess I don't know the answer to that, Wooble. I simply couldn't think of a better option. This is all stuff I'm scraping rather crudely from the legislature's website. — allocate, Jul 20 '14 at 20:10
@JoshuaAuriemma: Your efforts are likely to turn into data-cleansing. What are you trying to achieve? — MattH, Jul 20 '14 at 20:28
What does 'parse' mean in this case? What is your desired chunk look like? If the same statute is in the file, take the later one? Your question and desired outcome is not clear. — dawg, Jul 20 '14 at 21:35
I'd like to identify certain types of statutes (?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed) with re.findall, including the accompanying statute text. See the link in the text for a half-working example. It's currently identifying most of the instances but some are missed. — allocate, Jul 20 '14 at 22:02

pdw · Answer 1 · 2014-07-20T22:15:47.520

If you need a complicated regular expression, it's important to build it step by step. That's the only way to avoid getting lost.

Two notes before we start:

I'm not familiar with legal jargon. My terminology is probably all wrong.
I'll be using the the verbose flag. With this flag, you can freely insert whitespace in your regexes to improve readability.

Let's start with the statute numbers and define a regular expression that parses a single component (such as 34A or 83.1).

nbr = r'\d+ (?: \. \d+ )? [A-Z]?'

Three to five of these components, seperated by dashes, make a full statute number.

statute = r'%(nbr)s (?: - %(nbr)s ){2,4}' % {
    'nbr': nbr
}

Having this, we can easility define a regex that matches both a single statute and a range. We use two groups to capture the statutes. The second one will be empty is no range is given.

statute_or_range = r'(%(statute)s) (?: \s+ to \s+ (%(statute)s) )?' % {
    'statute': statute
}

And now we can construct a pattern to match the entire first line. At this point it's easy to deal with the comma that sometimes appears.

action = r'(?:Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Executed)'

first_line = r'%(statute_or_range)s ,? \s+ %(action)s \. \s+' %{
    'statute_or_range': statute_or_range,
    'action': action
}

It's not entirely clear to me how much text you want to match. My impression is that you want to capture up to the beginning of the next statute, which is defined as a line that starts with a statute number. So:

end = r'(?= \n %(statute)s )' % {
    'statute': statute
}

Combine these and you have your regular expression:

pattern = r'%(first_line)s (.*?) %(end)s' % {
    'first_line': first_line,
    'end': end
}

regex = re.compile(pattern, re.VERBOSE | re.DOTALL | re.IGNORECASE)

See it in action.

dawg · Accepted Answer · 2014-07-20T22:57:21.650

The example text:

34A-6-87.1 Disposal of tire waste--Collection or processing sites--Penalties for violations.
     34A-6-87.1. Disposal of tire waste--Collection or processing sites--Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-6-89-34 Scale device required--Records--Report--Contents--Permit for longer capacity disposal.
     34A-6-89. Scale device required--Records--Report--Contents--Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

I am assuming you want to break that text into blocks separated by the statute referenced.

If so, simplify your regex. You can do:

'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?\s+.*?(?=\n\n|\n+\Z|\Z))'

^ assert position at start of a line
1st Capturing group (\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))
\d+ match a digit [0-9]
\w+ match any word character [a-zA-Z0-9_]
- matches the character - literally
\d+ match a digit [0-9]
- matches the character - literally
\d+ match a digit [0-9]
(?:[,.\-0-9A-Z]+)? Non-capturing group
[ \t]+ match a single character present in the list below
.*? matches any character
(?=\n\n|\n+\Z|\Z) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: \n\n
2nd Alternative: \n+\Z
3rd Alternative: \Z
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
s modifier: single line. Dot matches newline characters

Note:

the use of the anchor ^ combined with re.S | re.M
moving the positive lookahead of (?=\n\n|\n+\Z|\Z) to the end.

Example in regex101

Once you have the individual blocks, you can further parse those blocks to find what you need. As a simple example:

statutes={}
pat=re.compile(r'^(\d+\w+-\d+-\d+(?:[,.\-0-9A-Z]+)?[ \t]+.*?(?=\n\n|\n+\Z|\Z))', re.S | re.M)
for block in pat.finditer(txt):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌ed)', block.group(1))
    if m:
        statutes.setdefault(m.group(1), []).append(block.group(1))
    else:
        statutes.setdefault('Enacted', []).append(block.group(1))    

for status in sorted(statutes):
    print '{} ============\n{}\n'.format(status, '\n\n'.join(statutes[status]))

Which separates out the example text into the status of the various statutes (Enacted, repealed, xfered, etc)

Like so:

Enacted ============
34A-6-87.1 Disposal of tire waste--Collection or processing sites--Penalties for violations.
     34A-6-87.1. Disposal of tire waste--Collection or processing sites--Penalties for violations. Any person hauling or transporting any waste tire as defined in subdivision 34A-6-61(25), originating from a wholesaler or retailer shall ensure the proper disposal of the waste tire at a department approved waste tire collection or processing site, or that it is used in some other manner approved by the department. The board may promulgate rules, pursuant to chapter 1-26, setting forth the requirements and procedures for department approval of waste tire collection, processing sites, or other approved uses for waste tires. Any waste tire hauler or transporter who intentionally disposes of any waste tire in a manner inconsistent with the provisions of this section is subject to a civil action by the State of South Dakota in circuit court for the recovery of a civil penalty of not more than ten thousand dollars per day per violation, or for costs to clean up sites not approved, or both. The violator is also subject to the criminal penalties provided for in § 34A-6-87.
Source:
  SL 1998, ch 202, § 1.          Source:

34A-6-89-34 Scale device required--Records--Report--Contents--Permit for longer capacity disposal.
     34A-6-89. Scale device required--Records--Report--Contents--Permit for longer capacity disposal. Any solid waste facility permitted to dispose of solid waste in excess of one hundred thousand tons per year shall be equipped with a scale device, approved by the Department of Public Safety, and shall weigh and maintain records of the total amount of solid waste disposed of at the facility. On or before the fifteenth of each month, the facility shall submit to the department a report upon such forms as may be prescribed by the department in rules promulgated pursuant to chapter 1-26. The report shall state the total amount of solid waste disposed of at the facility in the preceding month. The forms shall contain a sworn certification by the owner or operator that the information contained in the monthly report is true and correct based upon his own best information, knowledge, and belief. No facility may dispose of solid waste in excess of one hundred fifty thousand tons per year without a permit authorizing the capacity of the facility to dispose of solid waste in such quantities as provided in § 34A-6-1.16.
Source:
  SL 1992, ch 254, § 50Q; SL 2004, ch 17, § 231.          Source:

Repealed ============
34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

34A-1-28 Repealed.
     34A-1-28. Repealed by SL 1986, ch 295, § 7.

Transferred ============
34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-8-2A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88, Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

34A-6-88 to 23-34-1A Transferred.
     34A-6-88. Transferred to § 46A-1-83.1.

As an example of how SIMPLE your regex CAN be, at least with the example text, you can just use Python's split method with \n\n returns to get the same result:

statutes={}
for block in txt.split('\n\n'):
    m=re.search(r'^.*(Superseded|Repealed|Transferred|Obsolete|Reserved|Rejected|Omitted|Not|Execut‌ed)', block)
    if m:
        statutes.setdefault(m.group(1), []).append(block)
    else:
        statutes.setdefault('Enacted', []).append(block)   
# etc

At a standstill parsing some statutes with regex

2 Answers2