Quick intro to what I wanna achieve:
- I have a PDF with product orders from a supplier
- I wanna match the product names from the order with product names in our website to then add the amounts of products, costs and so on to our logistics (that's not actually very relevant to this question)
bottom line is I want to read a pdf, do some code with the data and then at the end convert it into a csv
Issues I've noticed that the pdf from the order gets generated very poorly (Im using Itextsharp to get the text from all pages line by line)
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
//page = page.Substring(page.IndexOf("Total") + 6);
string[] lines = page.Split('\n');}
Now that all works well and I can get line by line, the issue is that the text I get from the pdf isn't actually line by line as it shows on the pdf. The following is an example
- The highlited is the product name, which is the most relevant. There comes the first issue that for pdf, that's 2 lines and not one. That in itself wouldn't be a problem, I'm using a logic that appends the lower part to the upper product name, but unfortunately sometimes lines are duplicated or the order of the lines is completely off when i get the text via ITextSharp
I'm currently seperating the lines by " " spaces and then determening which field is which if like here:
if(!item.Contains("."))
But basically it seems impossible to arrange the string I get from the pdf properly so it doesn't generate any mistakes because the string I get is so flawed.
I tried to export the pdf to excel, txt files, csv etc with loads of different converters, but all results seem off.
Surely there is a way to properly just read the pdf lines the way they are displayed?