Extract only specific text from PDF using Python

Question

Need to extract the specific text only from Invoice PDF file having different PDF structure using python and store the output data into particular excel columns. All the PDF files have different structure but same content values.

Tried to solve it but not able to extract the specific text values only.

Sample PDF file :

Click to view the sample file

Need to Extract Invoice ID, Issue Date, Subject, Amount Due from the whole PDF file.

Script i have used so far:

import PyPDF2
import re
pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
text = str(pageObj.extractText())

quotes = re.findall(r'"[^"]*"',text)
print(quotes)

Are you trying to capture the values of Invoice ID, Issue Date, Subject, Amount Due or jus these text — Seyi Daniel, Oct 04 '20 at 15:52
@SeyiDaniel - Yes, Exactly i am trying to extract the values for these sections from the whole pdf . — Manz, Oct 05 '20 at 06:56

D-E-N · Accepted Answer · 2020-10-05T19:36:20.663

You have a very nice pdf document, because your pdf has form fields, so you can use them directly to read the data:

import PyPDF2


pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

fields = pdfReader.getFormTextFields()

print(fields["Invoice ID"])
print(fields["Issue Date"])
print(fields["Subject"])
print(fields["Amount Due"])

EDIT: I combined your requested data (from here: How to extract only specific text from PDF file using python) in a little script with 3 opportunities of parsing the pdf (for your 3 pdfs). The problem is your pdfs have a lot of differences and the packages have some advantages on different pdfs, so i think you have to combine this stuff. The thing is, that you try all functions, till it gets a result. I hope this is an good start for you. You may have to change the regexes, if you have more different pdfs and may you have to store all regex (per field) in an array and use them on the different functions so you have 3 functions for parsing and 4 lists of regexes to use in 2 of the functions.

import PyPDF2
import re
import os

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser


def parse_pdf_by_regex_2(filename: str) -> dict:
    output_string = StringIO()
    with open(filename, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    regex_invoice_no = re.compile(r"Invoice No.:\s*(\w+)\s")
    regex_order_no = re.compile(r"IRN:\s*(\d+)")
    regex_due_date = re.compile(r"Due Date: (\d{2}\.\d{2}\.\d{4})")
    regex_total_due = re.compile(r"([\d,.]+) \n\nTotal Invoice Value\(in words\)")

    try:
        return {"invoice_id": re.search(regex_invoice_no, output_string.getvalue()).group(1),
                "issue_date": re.search(regex_due_date, output_string.getvalue()).group(1),
                "subject": re.search(regex_order_no, output_string.getvalue()).group(1),
                "amount": re.search(regex_total_due, output_string.getvalue()).group(1)}

    except AttributeError as err:
        print("Not all elements have been found")
        return {}


def parse_pdf_by_form_fields(filename: str) -> dict:
    with open(filename, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        try:
            fields = pdf_reader.getFormTextFields()
        except TypeError as err:
            # print("No FormFields available")
            return {}

    try:
        # You can also check if onyly missing some values, maybe this can happen, but this is up to your data
        return {"invoice_id": fields["Invoice ID"],
                "issue_date": fields["Issue Date"],
                "subject": fields["Subject"],
                "amount": fields["Amount Due"]}
    except KeyError as err:
        # print(f"Key not found: '{err.args[0]}'")
        return {}


def parse_pdf_by_regex(filename: str) -> dict:
    with open(filename, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        text_data = ""
        for page_no in range(pdf_reader.getNumPages()):
            text_data += pdf_reader.getPage(page_no).extractText()

    regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")
    regex_order_no = re.compile(r"Order Number(\d+)")
    regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")
    regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")

    try:
        return {"invoice_id": re.search(regex_invoice_no, text_data).group(1),
                "issue_date": re.search(regex_due_date, text_data).group(1),
                "subject": re.search(regex_order_no, text_data).group(1),
                "amount": re.search(regex_total_due, text_data).group(1)}

    except AttributeError as err:
        # print("Not all elements have been found")
        return {}


def parse_pdf(filename: str) -> dict:
    # Hint: ':=' is available since pythoon 3.8
    if data := parse_pdf_by_form_fields(filename=fname):
        return data
    elif data := parse_pdf_by_regex(filename=fname):
        return data
    elif data := parse_pdf_by_regex_2(filename=fname):
        return data
    else:
        print("No data found")
        return {}


if __name__ == '__main__':
    for fname in os.listdir("."):
        if fname.startswith("testfile"):
            print(f"check {fname}")
            print(parse_pdf(filename=fname))

Its working with this PDF but when we try to capture the same fields from another PDF(having data in different format) gives error : TypeError: 'NoneType' object is not iterable. How can we overcome this error — Manz, Oct 05 '20 at 09:51
- How shall we deal with the scenario where we have data in different PDF format files, As I have tried to use the regex function to find the values but due to not having the spaces between the particular texts couldn't capture the data. — Manz, Oct 05 '20 at 11:38
We can not help, if you ask a question with a pdf, and have a problem with an other pdf, which you are not providing here. You have to share this stuff too (Did you mean your crosspost https://stackoverflow.com/questions/64142307/how-to-extract-only-specific-text-from-pdf-file-using-python/64158092) — D-E-N, Oct 05 '20 at 16:46
Yes Provided the sample PDF file link on the - https://stackoverflow.com/questions/64142307/how-to-extract-only-specific-text-from-pdf-file-using-python/64158092 , You can have the look into it. — Manz, Oct 05 '20 at 18:35
Thanks for the valuable answer but as new to this language, have a question into this how and where should we add the name of the pdf file into the updated script. — Manz, Oct 06 '20 at 04:58
The line starting with `if __name__` is the starting point of the script. It iterates over the files in the actual directory `.` and checks if the filename starts with `testfile` because i saved your 3 files with souch names. The call of the function with filename as parameter is the last line, so you can call the function with `parse_pdf()`. This functions uses the other functions to try to parse the file. — D-E-N, Oct 06 '20 at 09:34
Thanks for the feedback, but when using the above function ""if __name__ == '__main__':"" getting this error "FileNotFoundError: [Errno 2] No such file or directory: 'testfile.pdf'" . Using python 3.8 version. but when using the direct filename it works. — Manz, Oct 06 '20 at 18:49
the `main` stuff is for reading some files, laying next to the script, but if you get it done with explicitly name them, all is fine — D-E-N, Oct 06 '20 at 19:05

Extract only specific text from PDF using Python

1 Answers1