0

Quick intro to what I wanna achieve:

  • I have a PDF with product orders from a supplier
  • I wanna match the product names from the order with product names in our website to then add the amounts of products, costs and so on to our logistics (that's not actually very relevant to this question)

bottom line is I want to read a pdf, do some code with the data and then at the end convert it into a csv

Issues I've noticed that the pdf from the order gets generated very poorly (Im using Itextsharp to get the text from all pages line by line)

for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string page = "";

                page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
                //page = page.Substring(page.IndexOf("Total") + 6);
                
                string[] lines = page.Split('\n');}

Now that all works well and I can get line by line, the issue is that the text I get from the pdf isn't actually line by line as it shows on the pdf. The following is an example

enter image description here

  • The highlited is the product name, which is the most relevant. There comes the first issue that for pdf, that's 2 lines and not one. That in itself wouldn't be a problem, I'm using a logic that appends the lower part to the upper product name, but unfortunately sometimes lines are duplicated or the order of the lines is completely off when i get the text via ITextSharp

I'm currently seperating the lines by " " spaces and then determening which field is which if like here:

if(!item.Contains("."))

But basically it seems impossible to arrange the string I get from the pdf properly so it doesn't generate any mistakes because the string I get is so flawed.

I tried to export the pdf to excel, txt files, csv etc with loads of different converters, but all results seem off.

Surely there is a way to properly just read the pdf lines the way they are displayed?

Mischa Morf
  • 331
  • 1
  • 4
  • 14
  • This looks like it may be using a table in the pdf, if so, you can look at this related answer: [itextsharp-how-to-read-table-in-pdf-file](https://stackoverflow.com/questions/40929677/itextsharp-how-to-read-table-in-pdf-file) – Ryan Wilson Oct 31 '22 at 13:23
  • @RyanWilson thanks! I tried the suggested steps there with changing the strategy, the string i get looks better but still has some flaws that are hard to get out. – Mischa Morf Oct 31 '22 at 13:43
  • 1
    Are the PDFs you receive from the same source, created the same way? In that case can you share an example for analysis? – mkl Oct 31 '22 at 15:02
  • 1
    PDFs are for printing, not reading. A PDF file contains print commands, not markup. It doesn't even support tables. What you see is black lines and text that look like a table when they're rendered. Libraries that read PDF data like [Tabula](https://github.com/tabulapdf/tabula-java) *guess* tables, footers and rows in the PDF using heuristics and sometimes even OCR. That's why the good ones aren't free. Tabula even allows you to specify the range in each page to analyze (to exclude headers and footers) and column coordinates – Panagiotis Kanavos Nov 01 '22 at 07:58
  • 1
    There's no guarantee that the text you see flows *horizontally*. An application could generate print commands one column at a time. That's why row/column selection in PDFs seems to behave so strangely. – Panagiotis Kanavos Nov 01 '22 at 08:01
  • thanks for all the answeres. I've managed to get pretty decent results by using the SimpleTextExtractionStrategy() from itextsharp. It's nowhere near perfect and requires a ton of code to format, workes better than the locationstrategy tho, that just creates so many issues – Mischa Morf Nov 01 '22 at 12:59

0 Answers0