PDF Parsing a sentence across multiple Lines

Question

Goal: if pdf line contains sub-string, then copy entire sentence (across multiple lines).

I am able to print() the line the phrase appears in.

Now, once I find this line, I want to go back iterations, until I find a sentence terminator: . ! ?, from the previous sentence, and iterate forward again until the next sentence terminator.

This is so as I can print() the entire sentence the phrase belongs in.

Jupyter Notebook:

# pip install PyPDF2
# pip install pdfplumber

# ---
# import re
import glob
import PyPDF2
import pdfplumber

# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')

def scrape_sentence(sentence, lines, index):
    if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
        return sentence.replace('\n', '').strip()
    sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1)  # previous line
    sentence = scrape_sentence(sentence + lines[index+1], lines, index+1)  # 
following line    
    return sentence
    
# ---    
    
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
                sentence = scrape_sentence('', lines, i)  # !
                print(sentence)  # !
            i += 1

Output:

connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there

Phrase:

Responsible Care Company

Sentence (across multiple lines):

"GPIC is a Responsible Care Company certified for RC 14001 
since July 2010."

PDF (pg. 2).

I have been working on "back-tracking" iterations, based on this solution. I did try a for-loop, but it doesn't let you back back iterations.

Regex sentence added

Please let me know if there is anything else I can add to post.

If you have a new question, don't edit your question to change it, as this invalidates any answers that may have been posted. Instead, [post a new question](https://stackoverflow.com/questions/ask) — Zoe, Nov 29 '21 at 13:36

score -1 · Answer 1 · answered Nov 29 '21 at 12:45

The error you are getting is caused by your code attempting to modify an object of type None.

To fix this there are two option, the first is to surround the split operation in an if statement

for page in opened_pdf.pages:
    text = page.extract_text()
    if text != None:
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
                sentence = scrape_sentence('', lines, i)
                print(sentence)
            i += 1

Or you could use a continue statement to skip the rest of the loop:

for page in opened_pdf.pages:
    text = page.extract_text()
    if text == None:
        continue
    lines = text.split('\n')
    i = 0
    sentence = ''
    while i < len(lines):
        if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
            sentence = scrape_sentence('', lines, i)  # !
            print(sentence)  # !
        i += 1

Have you considered tracing it twice? once to find the position of the sentence and once to find the the sentence terminators nearest to it? — Chris, Nov 29 '21 at 13:41

score -1 · Accepted Answer · answered Nov 30 '21 at 10:02

I have a working version. However, this does not account for multiple columns of text from a .pdf page.

See here for a discussion related to that.

Example .pdf

Jupyter Notebook:

# pip install PyPDF2
# pip install pdfplumber

# ---

import glob
import PyPDF2
import pdfplumber

# ---

def scrape_sentence(phrase, lines, index):
    # -- Gather sentence 'phrase' occurs in --
    sentence = lines[index]
    print("-- sentence --", sentence)
    print("len(lines)", len(lines))
    
    # Previous lines
    pre_i, flag = index, 0
    while flag == 0:
        pre_i -= 1
        if pre_i <= 0:
            break
            
        sentence = lines[pre_i] + sentence
        
        if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or '  •  ' in lines[pre_i]:
            flag == 1
    
    print("\n", sentence)
    
    # Following lines
    post_i, flag = index, 0
    while flag == 0:
        post_i += 1
        if post_i >= len(lines):
            break
            
        sentence = sentence + lines[post_i] 
        
        if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or '  •  ' in lines[pre_i]:
            flag == 1 
    
    print("\n", sentence)
    
    # -- Extract --
    sentence = sentence.replace('!', '.')
    sentence = sentence.replace('?', '.')
    sentence = sentence.split('.')
    sentence = [s for s in sentence if phrase in s]
    print(sentence)
    sentence = sentence[0].replace('\n', '').strip()  # first occurance
    print(sentence)
    
    return sentence

# ---

phrase = 'Global Reporting Initiative'

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        if text == None:
            continue
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if phrase in lines[i]:
                sentence = scrape_sentence(phrase, lines, i)
            i += 1

Output:

-- sentence -- 2016 Global Reporting Initiative (GRI) Report
len(lines) 7

 2016 Global Reporting Initiative (GRI) Report

 2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
['2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01']
2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01

...

PDF Parsing a sentence across multiple Lines

2 Answers2

Linked