Is there a better way to extract data from a pdf?

Question

I am trying to automate the process of extracting data from my bank statements and introduce them into excel. I am aware that there are bank API's that facilitate getting this data directly but for this use case, I cannot use them.

I am using python3 and the PyPDF2 library. (I am using python3 because it the language I am most comfortable with, but I'm willing to try anything if there is one more suitable for my use case)

The code I have up to now:

# Importing required modules
import PyPDF2

def pdf_to_excel():
    # Creating a pdf file object
    pdfFileObj = open('Horses Jan 2020.pdf', 'rb')
    
    # Creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    # Printing number of pages in pdf file
    print(pdfReader.numPages)

    # Printing Document Layout
    print(pdfReader.pageLayout)
    
    # Creating a page object
    pageObj = pdfReader.getPage(0)

    # Extract movement details into a dict
    movments = extract_movement_details(pageObj)

    
    # Closing the pdf file object
    pdfFileObj.close()
    
    return movments

def extract_movement_details(pageObj):
    
    # Extracting text from page
    pdftext = pageObj.extractText()

    # Split pdf text
    text_split = pdftext.split()

    """ Extract General Information from Bank Statment """
    # Extract General Information
    client_n = text_split[10].replace('Cliente', '').split('R', 1)
    general_info = {
        'From': text_split[1],
        'To': text_split[3].replace('Fecha', ''),
        'Account No.': text_split[8].replace('Cuenta', '').replace('No.', ''),
        'Client No.': client_n[0],
        'R.F.C': client_n[1].replace('.F.C', '').replace('No.', ''),
        'No. Key Account': text_split[12].replace('CLABE', '').replace('PAGINA', '')
        }
    
    # Extract movments information

    movement_details = {
        'Date Opt': 0,
        'Date Liq': 0,
        'Description': 0,
        'Reference': 0,
        'Debit': 0,
        'Credit': 0,
        'Operation': 0,
        'Liquidation': 0
        }

    print(general_info)
    print(movement_details)
    print("\n")
    
    # Printing the text from the page
    print(text_split)
    
    movments = text_split
    return movments


pdf_to_excel()

It uses PyPDF2 and consists of two functions, one that opens the file and one that splits the text and extracts the data I need. Basically, I am trying to recognize patterns and extract the data in accordance with those patterns. However, it is proving to be much harder than I thought it would be because the data does not seem to be organized in a logical way on the list I got. Words are stuck together, they are not in the same order as they are visually on excel, and recognizing the patterns is proving quite difficult.

Unfortunately, since it's banking information, I am not sure how I can display it here without giving away sensitive information but I will work on that and try to post the text extracted and the way the docs look. For now here is a picture with the sensitive information redacted,

Moreover, before trying with PyPDF2 I tried with tabula but since the movements are in a table with invisible borders it did not recognize them as tables.

Also, I will have to do this with many more banks in the future so if what I want can be done with a technology with a larger learning curve I am okay with that.

Did you try Camelot also? https://pypi.org/project/camelot-py/ — K.Cl, Apr 02 '21 at 22:20
Nop, I have not heard of it. Will look now and get back to you — Chris, Apr 02 '21 at 22:20
I will give it a try now, however as I can see the table they have on their example has clearly defined borders. And I got 13 pages of tables with invisible borders. such as the one in the second half of the image. So I don't think it will work. @K.CI — Chris, Apr 02 '21 at 22:24
I just want to caution you, PDFs are designed to print beautifully and consistently. They are not designed to allow data to be extracted. Characters that appear together on the page are not necessarily adjacent to each other in the PDF code. It's always going to be a bit of a struggle. This is why the APIs are so important. — Tim Roberts, Apr 02 '21 at 22:25
Thank you @TimRoberts I have become aware of that. But it seems to be what I need to do. And even tho they are designed to be beautiful they still follow the same pattern for every month and every person. So if I can extract from one Statement for this bank i will be able to extract for each statement for any person. Right? — Chris, Apr 02 '21 at 22:29
I don't know how much hacking you want to do, but you can use `pdftk` to uncompress the PDF content and look at the result in a text editor. You MIGHT be able to find the strings you want in a consistent way. — Tim Roberts, Apr 02 '21 at 22:31
Maybe [`pdfminer`](https://pypi.org/project/pdfminer/) can help you out? [here's an example of how to use pdfminer](https://stackoverflow.com/a/26495057/6934388) And have you considered [OCR](https://pypi.org/project/pytesseract/)? — Queuebee, Apr 02 '21 at 22:36
I will give that a try but I don't think it will give me any different result than what I have now. As I currently end up with a large strick with all the text on the pdf page and start doing my string manipulation. Do you think the resulting string from ""pdftk"" will be better organized than the one from """PyPDF2"""? @TimRoberts — Chris, Apr 02 '21 at 22:39
Same question @Queuebee you think the string extracted with pdfminer will be any different or better than with PyPDF2? — Chris, Apr 02 '21 at 22:43
Well, `pdftk` would just uncompress the Postscript code that makes up the PDF. Almost every PDF is created with compression, because the Postscript is quite wordy. I guess I'm not really sure that's a valuable thread for you to follow. — Tim Roberts, Apr 02 '21 at 22:43
@Queuebee well, I considered OCR but for the bank statements, I got in as a picture, and also for invoices that come as pictures. Again, I'm not sure it will provide different results than just the text extraction with PyPDF2 — Chris, Apr 02 '21 at 22:47
@TimRoberts I kind of think the same, I'm not sure it gets me any closer to my goal of having software that transforms pdf bank statements into excel and well other formats but where I can manipulate the data. — Chris, Apr 02 '21 at 22:50
@ChristopherHolder [this post](https://stackoverflow.com/a/61857301/6934388) says pdfminer works more reliably than pypdf2. I have had consistent results myself, so unless you provide the format of the output in contrast with the input pdf I can't judge that. — Queuebee, Apr 02 '21 at 22:50
@Queuebee Since it is sensitive data and if I obfuscate it there will be no useful information, I can't do that. But that you, I will try it right now and let you know. — Chris, Apr 02 '21 at 22:53
@Queuebee Yes, I am getting considerably better results from pdfminer than from pypdf2. Thanks — Chris, Apr 02 '21 at 23:25
Issue with that is the nature of PDFs. Sometimes values get split into multiple values, specially for text where it may recognize two lines of text as two values. However the apreach you speak of worked for me, but requires a lot more code and only worked with a good understanding of the file and the way things are possioned on it. So it does not an a new bank stamens from a different bank for example. I think it’s more about you use case — Chris, Aug 08 '22 at 03:27
PyPDF2 had a lot of text extraction improvements in June 2022. Please make sure you're using the latest PyPDF2 version. If you still think pdfminer gives better results, I would be very interested in what "better" means for you (I'm the maintainer of PyPDF2 since April 2022). If you could upload an example PDF document it would be very helpful. — Martin Thoma, Aug 08 '22 at 09:54

Is there a better way to extract data from a pdf?

0 Answers0