Trying to loop through multiple PDF files and extract text between two search criteria

Question

I am trying to look at multiple PDF files, look at the text of each, and extract paragraphs between (start) 'NOTE 1- ORGANIZATION' and 'NOTE 2- ORGANIZATION' (end). Each file has different text in this place, and I want to print each paragraph from each file, or save the paragraph to a text file.

Below, I put together a small script that opens one file, finds one string of text, and prints the page that the text is found on. This is a good start, I think, but I really want to loop through many PDF files, look for a specific body of text, and save everything that is found to a single text file.

import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("C:/my_path/file1.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "New York State Real Property Law"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

Any insights into solving this problem are greatly appreciated!

What's your issue exactly? It sounds like you're trying to find a way to optimize / improve the processing time. If so, you may want to examine the multiprocessing module: https://docs.python.org/2/library/multiprocessing.html — jrd1, Jul 31 '18 at 20:41
No, time is not an issue. I don't care if it takes a few seconds or a few hours (I doubt it would take this long). I want to loop through multiple PDF files and extract text between a starting point and an ending point. Now, my code looks at 1 file and 1 string. I want to look at N files and 2 strings. Thanks. — ASH, Jul 31 '18 at 20:53

Me. · Accepted Answer · 2018-08-02T05:01:31.687

if your file names is like file1.pdf, file2.pdf, and ... then you may use a for loop:

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()

    # define keyterms
    String = "New York State Real Property Law"

    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page " + str(i)) 
        Text = PageObj.extractText() 
        # print(Text)
        ResSearch = re.search(String, Text)
        print(ResSearch)

otherwise you can walk through your folder using os module

import PyPDF2
import re
import os

for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # define keyterms
        String = "New York State Real Property Law"

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
            ResSearch = re.search(String, Text)
            print(ResSearch)

sorry if I recognized your problem wrong.

EDIT:

unfortunately I'm not familiar with pyPDF2 module but it seems when you convert contents of a pdf using this module something weird (like additional newlines or format changing or ...) happens.

may this page helps: Extracting text from a PDF file using Python

however if your file was .txt then a regex was helpful

import re
import os
myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/Users/Mirana/Me2"):
    for file in files:
        object=open(os.path.join(foldername,file))
        Text=object.read()
        for subText in myRegex.findall(Text):
            print(subText)

object.close()

I changed your pdf version too but cause of the problem mentioned above it doesn't work at least for my pdfs (give it a try):

import PyPDF2
import re
import os

myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
        for subText in myRegex.findall(Text):
            print(subText)

That solves part of my problem. Thanks. Now, when I loop through all files, I want to pull out a string-paragraph that starts with 'New York State Real Property Law' and ends with 'common elements of the property.'. I want to print all the text between those anchors, and including those anchors. How can I do that? — ASH, Aug 01 '18 at 11:25
Yeap, that first one works! Basically, I got this error when trying to read in all PDF files: ''charmap' codec can't decode byte 0x90 in position 389: character maps to ' When I converted all PDFs to TEXT files, everything worked fine. As for the second one, all it does is print 'this is page 0'...'this is page 11'. That's fine. I have one working solution which is all I need! Thanks so much!! — ASH, Aug 02 '18 at 11:42

Trying to loop through multiple PDF files and extract text between two search criteria

1 Answers1