I am trying to automate the process of extracting data from my bank statements and introduce them into excel. I am aware that there are bank API's that facilitate getting this data directly but for this use case, I cannot use them.
I am using python3 and the PyPDF2 library. (I am using python3 because it the language I am most comfortable with, but I'm willing to try anything if there is one more suitable for my use case)
The code I have up to now:
# Importing required modules
import PyPDF2
def pdf_to_excel():
# Creating a pdf file object
pdfFileObj = open('Horses Jan 2020.pdf', 'rb')
# Creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Printing number of pages in pdf file
print(pdfReader.numPages)
# Printing Document Layout
print(pdfReader.pageLayout)
# Creating a page object
pageObj = pdfReader.getPage(0)
# Extract movement details into a dict
movments = extract_movement_details(pageObj)
# Closing the pdf file object
pdfFileObj.close()
return movments
def extract_movement_details(pageObj):
# Extracting text from page
pdftext = pageObj.extractText()
# Split pdf text
text_split = pdftext.split()
""" Extract General Information from Bank Statment """
# Extract General Information
client_n = text_split[10].replace('Cliente', '').split('R', 1)
general_info = {
'From': text_split[1],
'To': text_split[3].replace('Fecha', ''),
'Account No.': text_split[8].replace('Cuenta', '').replace('No.', ''),
'Client No.': client_n[0],
'R.F.C': client_n[1].replace('.F.C', '').replace('No.', ''),
'No. Key Account': text_split[12].replace('CLABE', '').replace('PAGINA', '')
}
# Extract movments information
movement_details = {
'Date Opt': 0,
'Date Liq': 0,
'Description': 0,
'Reference': 0,
'Debit': 0,
'Credit': 0,
'Operation': 0,
'Liquidation': 0
}
print(general_info)
print(movement_details)
print("\n")
# Printing the text from the page
print(text_split)
movments = text_split
return movments
pdf_to_excel()
It uses PyPDF2 and consists of two functions, one that opens the file and one that splits the text and extracts the data I need. Basically, I am trying to recognize patterns and extract the data in accordance with those patterns. However, it is proving to be much harder than I thought it would be because the data does not seem to be organized in a logical way on the list I got. Words are stuck together, they are not in the same order as they are visually on excel, and recognizing the patterns is proving quite difficult.
Unfortunately, since it's banking information, I am not sure how I can display it here without giving away sensitive information but I will work on that and try to post the text extracted and the way the docs look. For now here is a picture with the sensitive information redacted,
Moreover, before trying with PyPDF2 I tried with tabula but since the movements are in a table with invisible borders it did not recognize them as tables.
Also, I will have to do this with many more banks in the future so if what I want can be done with a technology with a larger learning curve I am okay with that.