How to extract data from messy PDF file with no standard formatting?

Question

I am working on this PDF file to parse the tabular data out of it. I was hoping to use tabula or PyPDF2 to extract tables out of it but the data in PDF is not stored in tables. So, I chose pdfplumber to extract text out of it. Until now, I am able to read text line by line. But I can not figure out a universal pattern that I can use to extract the pricing list rows which I can store in a pandas dataframe and write to an excel file.

Can you help me if I should construct a regular expression or anything else that I can use to extract the pricing list out of this PDF? Because I can not think of any particular regular expression that would fit the messy nature of data inside the PDF, is there any better approach to take? Or simply it's not possible?

Code

Using the following code, I am able to extract all lines of text but the problem is, one price entry is spread across two rows. Consider current row is where most details about the entry are listed, how can I decide if the previous or next row also has information related to current entry.

If I could somehow figure that out, what might be the right approach to deal with the column values, they can be from 6-13 per line, how can I decide if at this particular location in current line, the column value resides?

import pdfplumber as scrapper

text = []
with scrapper.open('./report.pdf') as pdf:
    for page in pdf.pages:
        text.append(page.extract_text())

The PDF file I am working with: https://drive.google.com/file/d/1GtjBf9FcKJCOJVNcGA9mvAshJ6t0oFca/view?usp=sharing

Sample Pictures demonstrating which data should fit in which fields:

I've heard good things about the [Camelot](https://camelot-py.readthedocs.io/en/master/) library. — AKX, Dec 14 '21 at 13:21
Thanks for the suggestion @AKX, I did try Camelot too but it doesn't extract the tabular data because it is in text format. — Aamir Khan Maarofi, Dec 14 '21 at 14:17
*"The PDF file I am working with:"* - apparently you have disallowed download of the file... — mkl, Dec 14 '21 at 15:08
Dear @mkl, the file can be downloaded now, I changed the permissions. Please have a look — Aamir Khan Maarofi, Dec 14 '21 at 15:25
I would try to implement that in multiple steps. The first one would be to read the data from the document as is into a tabular structure, including columns for the article groups and subgroups, and also for the table headings which differ from a bit across the document. As soon as I'd have the data in that format, I'd start interpreting the data into additional columns as you want them eventually. (I hardly know any Python at all, though, so I cannot tell how to do this using Python; I'd have some ideas using Java.) — mkl, Dec 14 '21 at 18:02
Thank you for commenting @mkl, I really appreciate that. I thought the same and I am able to extract the data line-by-line. But I can not figure out how to deal with one price entry spread across multiple rows? My script consider each line as a new entry because the data in pdf is not clean (the product name is listed on current and previous or current and next line, or for one product there are multiple colors page 13,14), how can I get all these different formats into one tabular format. Should I write different rules for them? each considering different type of data in the same script? — Aamir Khan Maarofi, Dec 14 '21 at 18:10
In the first pass I'd put those groups and subgroups into extra columns having recognized them by their font size. — mkl, Dec 14 '21 at 18:23
@mkl, I did try to extract the text based on font sizes but it turned out that the PDF has variable font sizes for different headings (Column headers). Also, the tables data headings match the description text (In this case the script extracts both types of text). You have been very helpful throughout the whole time and I really appreciate it. Is there anything else I could try? Perhaps any other approach — Aamir Khan Maarofi, Dec 15 '21 at 18:19
Well, probably a combination of font size range and approximate position on page? — mkl, Dec 15 '21 at 21:11
@Aamir Maarofi: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 — 123456, Jun 02 '22 at 09:26

How to extract data from messy PDF file with no standard formatting?

0 Answers0