I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula
or PyPDF2
to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber
to extract text out of it. Until now, I am able to read text line by line. But I can not figure out a universal pattern that I can use to extract the pricing list rows which I can store in a pandas dataframe and write to an excel file.
Can you help me if I should construct a regular expression or anything else that I can use to extract the pricing list out of this PDF? Because I can not think of any particular regular expression that would fit the messy nature of data inside the PDF, is there any better approach to take? Or simply it's not possible?
Code
Using the following code, I am able to extract all lines of text but the problem is, one price entry is spread across two rows. Consider current row is where most details about the entry are listed, how can I decide if the previous or next row also has information related to current entry.
If I could somehow figure that out, what might be the right approach to deal with the column values, they can be from 6-13 per line, how can I decide if at this particular location in current line, the column value resides?
import pdfplumber as scrapper
text = []
with scrapper.open('./report.pdf') as pdf:
for page in pdf.pages:
text.append(page.extract_text())
The PDF file I am working with: https://drive.google.com/file/d/1GtjBf9FcKJCOJVNcGA9mvAshJ6t0oFca/view?usp=sharing
Sample Pictures demonstrating which data should fit in which fields: