Cleaning up a Table of Contents to extract just the Titles using Python?

Question

I'm working on an academic research project that requires extracting titles from a Table of Contents. I'm making a Python program to clean up text that looks like this:

BONDS OF LATE:
An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
An act to provide for publishing a now edition of Dresses Reports ..................................... 78

BRIDGES:
An act to provide for the better protection of the public bridges in this State ........................... 74

to look like this:

An act providing the officers of the State of Illinois from making payments on certain bonds .

An act to provide for publishing a now edition of Dresses Reports .

An act to provide for the better protection of the public bridges in this State .

My strategy is to somehow iterate through a text file and delete characters after the first '.' and before the next 'An act'. I thought about trying a nested 'for' loop like this:

for line in file:
    for character in line:

But iterating by character makes it impossible to stop at a string (i.e. 'An act'). I'm a beginner to Python (and coding) and would greatly appreciate any help. Are there regular expressions that would help delete all the characters in a line before 'An act' and after the first period? Thank you!

To clarify, are you trying to capture all lines that start with the phrase "An act"? If so, I think your formatting of the text might be off. Make sure you're starting each new line in the next with `>` in Markdown. — BrokenBenchmark, Jun 18 '22 at 15:50
That's not how it appears in the post. Can you double-check the formatted version of your quetion? — BrokenBenchmark, Jun 18 '22 at 16:10
There's no lines in the question which start with "An act". I took a look at your post and it looks like there's soft line breaks. See the [second example here](https://gist.github.com/shaunlebron/746476e6e7a4d698b373). I don't see how "the formatted version looks good to me" is consistent with "I want " when none of the lines start with "An act" in the question. — BrokenBenchmark, Jun 18 '22 at 16:19
The formatting is now fixed. To clarify, category titles (i.e. "BONDS OF LATE:") are irregularly interspersed and need to be removed. — fjturner, Jun 18 '22 at 18:01

BrokenBenchmark · Accepted Answer · 2022-06-18T18:24:29.397

1

You can use a regular expression that matches lines that start with "An act", followed by a space and at least one character, followed by a period (see this regex101 for more in-depth explanation). We use the non-greedy operator to stop at the first period, and we use ?: to indicate that there's a group that we don't care about capturing:

import re

with open("data.txt") as file:
    for line in file:
        search_result = re.search(r"^(An act (?:.+?)\.)", line)
        if search_result:
            print(search_result.group(1))

This outputs:

An act providing the officers of the State of Illinois from making payments on certain bonds .
An act to provide for publishing a now edition of Dresses Reports .
An act to provide for the better protection of the public bridges in this State .

edited Jun 18 '22 at 18:24

answered Jun 18 '22 at 18:18

BrokenBenchmark

18,126
7
21
33

Thank you! Instead of printing the result, is it possible to make it so the program writes to a new .txt file? – fjturner Jun 18 '22 at 18:46
Yes, absolutely -- open a second file for writing, and then replace the print with a call to `.write()`. – BrokenBenchmark Jun 18 '22 at 23:40

score 1 · Answer 2 · answered Jun 18 '22 at 18:29

A solution using regex and string.replace

>>> import re 
>>> lines="""
... BONDS OF LATE:
... An act providing the officers of the State of Illinois from making payments on certain bonds ............ 79
... An act to provide for publishing a now edition of Dresses Reports  ..................................... 78
... 
... BRIDGES:
... An act to provide for the better protection of the public bridges in this State ........................... 74
... """

>>> m = re.sub(r'\b[A-Z]+\b', '', line)
>>> m=m.replace(":","")
>>> m.replace(".","")
>>> m= ''.join(i for i in m if not i.isdigit())

>>> print(m)

An act providing the officers of the State of Illinois from making payments on certain bonds  
An act to provide for publishing a now edition of Dresses Reports  

An act to provide for the better protection of the public bridges in this State

Adopted from here

Cleaning up a Table of Contents to extract just the Titles using Python?

2 Answers2