2

I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:

  • downloads a pdf transcript from the web (Cassidy Hutchinson's 9/14/2022 interview transcript with the J6C)
  • reads/OCRs that pdf to text
  • attempts to split that text into the series of Q&A passages from the interview
  • runs a series of tests I wrote based on my manual read of the transcript

running the python code below generates the following output:

~/askliz  main !1 ?21  python stack_overflow_q_example.py                                                      ✔  docenv Py  22:41:00 
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
  File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10

Your mission, should you choose to accept it: get this passage10 test to pass without breaking one of the previous tests. I'm hoping there's a clever regex or other modification in extract_q_a_locations below that will do the trick, but I'm open to any solution that passes all these tests, as I chose these test passages deliberately.

A little background on this transcript text, in case it's not as fun reading to you as it is to me: Sometimes a passage starts with a "Q" or "A", and sometimes it starts with a name (e.g. "Ms. Cheney."). The test that's failing, for passage 10, is where a question is asked by a staff member whose name is then redacted. The only way I've managed to get that test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question. (Note: in the pdf/ocr library I'm using, pdfplumber, redacted text usually shows up as just a bunch of extra spaces).

Code below:

import nltk
import re
import requests
import pdfplumber


def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • I suspect it's not really possible to do this unless you can find a way to detect the redactions. For example, page 5, line 8 has a sentence which begins "for the committee." That text is in the same horizontal position as the first sentence said by a new speaker. The only way to tell that it isn't a new speaker is either 1) context or 2) seeing that the redaction covered the un-indented part of the line, and therefore the original line couldn't have been indented. – Nick ODell Feb 22 '23 at 06:35
  • 1
    If pdfplumber replaces redacted portions with spaces, then you can distinguish between 1) start of a new passage with a redacted name, and 2) continuation of current passage by starting with a new sentence that is indented. – EricC Feb 22 '23 at 07:04
  • 1
    For example, if the indentation is consistently 4 spaces, we could use that format (contextual info) to make `prefix_regex` more specific: `\n\s+\d+\s{4}`. Then `(?:{speaker_regex}|{qa_regex})` can be changed to `(?:{speaker_regex}|{qa_regex}|\s+)` to account for passages that start with redacted text. – EricC Feb 22 '23 at 07:13
  • Hi Nick, yeah your page 5 line 8 example is actually exactly what I had in mind when I wrote "The only way I've managed to get [passage10's] test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question." So I agree identifying redactions within the OCR'd text is probably part of the solution. I did include the location of the raw pdf and code to download the raw pdf, because I don't take for granted I'm going from pdf->text string in the best way, given there are important redactions. – Max Power Feb 22 '23 at 16:37
  • Hi EricC, thanks for the suggestion. However, if I understand you correctly, your proposed solution will run into the issue Nick and I have each encountered: it may correctly identify e.g. page 6 line 10 as the start of a question, but it will also incorrectly flag page 5 line 8 as the start of a question, because not all lines starting with a redaction indicate a question. So it will not pass all the tests together. – Max Power Feb 22 '23 at 16:40

1 Answers1

1

I have assumed that the redactions in between the passage are not required. What I have done is replaced the redacted name's spaces with Ms. Fakename. . This I did because as you have mentioned in your question, the required passages are either starting with a name or Q or A. When it starts with a name, you'll notice that the name ends with a period and then starts with a capital letter. When the name is redacted, and that is an answer, there are a lot of spaces before it. Combining all these observations, I was able to have all the tests passing by adding the following snippet

    lines = text.splitlines()

    for i in range(len(lines)):
        if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
            lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
    
    text = "\n".join(lines)

with the final code as

import nltk
import re
import requests
import pdfplumber


def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])
    
    lines = text.splitlines()

    for i in range(len(lines)):
        if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
            lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
    
    text = "\n".join(lines)

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = "Fakename So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg

Note that in the last test, I added "Fakename" as the prefix. If this is not desired, the passages list can be updated to remove the manually added prefix.

Samkit Jain
  • 1,560
  • 2
  • 16
  • 33
  • 1
    Hey thanks Samkit, this is awesome! After confirming I can reproduce all tests passing with your code, I further went through each passage one by one and read it side by side with the pdf transcript, and from pages 5(where "EXAMINATION"/interview starts) through page 11, so far every single passage delineation has been identified correctly. Stack Overflow says I can't award the bounty for 13 hours but I will do so then, unless a new answer emerges by then with some sort of improvement I am not anticipating. – Max Power Feb 24 '23 at 16:40
  • 1
    Hi @MaxPower Glad the answer could be of help :) – Samkit Jain Feb 26 '23 at 12:02