How to extract only specific text from PDF file using python

Question

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

    from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

But not getting the specific output value from the PDF file .

can you provide en example pdf file (as pdf, not as picture) — D-E-N, Sep 30 '20 at 17:59

D-E-N · Answer 1 · 2020-10-03T10:12:30.513

If you want to find the data in in your way (pdfminer), you can search for a pattern to extract the data like the following (new is the regex at the end, based on your given data):

from io import StringIO
import re

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('testfile.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

finding = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue())

invoice_no, order_no, _, due_date, total_due = finding.group(0).split("\n")

print(invoice_no, order_no, due_date, total_due)

If you want to store the data in excel, you may have to be more specific (or open a new question) or look on these pages:

Writing to an Excel spreadsheet

https://www.geeksforgeeks.org/writing-excel-sheet-using-python/

https://xlsxwriter.readthedocs.io/

PS: the other answer looks like a good solution, you only have to filter the data

EDIT: Second solution. Here I use another package PyPDF2, because there you get the data in an other order (maybe this is possible with PDFMiner, too). If the text before the values are always the same, you can find the data like this:

import re
import PyPDF2

def parse_pdf() -> list:
    with open("testfile.pdf", "rb") as file:
        fr = PyPDF2.PdfFileReader(file)
        data = fr.getPage(0).extractText()

    regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")
    regex_order_no = re.compile(r"Order Number(\d+)")
    regex_invoice_date = re.compile(r"Invoice Date(\S+ \d{1,2}, \d{4})")
    regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")
    regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")

    invoice_no = re.search(regex_invoice_no, data).group(1)
    order_no = re.search(regex_order_no, data).group(1)
    invoice_date = re.search(regex_invoice_date, data).group(1)
    due_date = re.search(regex_due_date, data).group(1)
    total_due = re.search(regex_total_due, data).group(1)

    return [invoice_no, due_date, total_due]


if __name__ == '__main__':
    print(parse_pdf())

Maybe you have to change the regexes, because they are only based on the given example. The regexes are only working if they find the regex, so you have to work with try: except per regex ;)
If this does not answer your question, you have to provide more information/example pdfs.

Thanks for the answer. but how we can use it in the case where the values which need to be extracted are not in sequence i.e at different positions in the PDF. and also is there any way to append all the values in a list. — Manz, Oct 03 '20 at 02:07
Its working, but couldn't extract the data when the format of PDF's are different, Trying to use the regex function but not sure whether Regex function will work or not if yes how it can be done. when tried to use the above script getting the error "AttributeError: 'NoneType' object has no attribute 'group'" — Manz, Oct 05 '20 at 14:28
Please find the link of the related sample PDF File - https://drive.google.com/file/d/1Kmg5QYQrnnhZ05wsGA0jBkCC1FJOUfbd/view from which the same data need to be extracted. — Manz, Oct 05 '20 at 18:33

score 0 · Answer 2 · answered Oct 01 '20 at 03:22

0

You can extract data using tabula and using that data you can create an excel file using python:

df = ("./Downloads/folder/myfile.pdf")
output = "./Downloads/folder/test.csv"
tabula.convert_into(df, output, output_format="csv", stream=True)

excel file creation: https://www.geeksforgeeks.org/python-create-and-write-on-excel-file-using-xlsxwriter-module/

answered Oct 01 '20 at 03:22

vgeorge

96
8

As mentioned we need to extract only the specific field values from the PDF file and store the data into excel columns. – Manz Oct 01 '20 at 04:32

How to extract only specific text from PDF file using python

2 Answers2

Linked