0

I am trying to parse a PDF file that has two columns of text on most pages and no images. I tried using the iTextSharp solution that can be found at how can i get text formatting with iTextSharp . It seemed to be working for me, but then I noticed some rather serious issues with the text being returned out of order in some places on my PDF. I am simply looking for it to parse the text in the same order that it exists on each page (no special order), but this is not happening. I was wondering if there is a version of the TextWithFontExtractionStrategy solution available in iText 7 that would not exhibit this problem (or even a version of iTextSharp that works correctly for that matter). I would appreciate any assistance.

Weequay
  • 13
  • 5
  • What issues? What's the code you're using? Is this a coding problem or a question about a 3rd party framework? – dirkgroten Feb 17 '20 at 19:17
  • The issues that I am having are regarding the usage of the the iTextSharp TextWithFontExtractionStategy code that is in the link that I provided. iTextSharp is a 3rd party framework that has been deprecated and replaced with iText 7. The iTextSharp code will parse text into a string just fine, but then in a seemingly random point it will start to bring in text from parts of a PDF page that are not in order. For instance, it will jump from parsing lines in the second column and then jump to text that is earlier in the PDF and in the first column. That is just an example. – Weequay Feb 17 '20 at 19:33
  • *"the text being returned out of order on my PDF"* - what do you mean exactly? That `TextWithFontExtractionStrategy` returns the text in the very order in which it is drawn. If you want text extraction in a different order, therefore, you have to describe the order you want before others can tell whether itext 7 supports that order. – mkl Feb 17 '20 at 22:11
  • I want to return the text as it exists in the PDF. However, surprisingly that is not happening as it should. I don't want it in any special order, just the order in which it is drawn. I would be happy to continue to use iTextSharp, but the code isn't working right. The text will appear to be drawing in the way it exists on the PDF, but then it will pull text that it is located somewhere else on the page. This is clearly not right. I suspect that the problem is caused by the fact that the PDF has two columns. – Weequay Feb 17 '20 at 23:21
  • *"I don't want it in any special order, just the order in which it is drawn"* - As mentioned above, that `TextWithFontExtractionStrategy` does return the text in the very order in which it is drawn. That order indeed may jump here and there all the time and doesn't need to respect columns or anything. As you call that *"clearly not right"*, therefore, you do want a special order. It sounds like you want something like the order in which you probably would read the text. Pdfs may contain hints that can be used to extract in such an order, but such hints are not mandatory. – mkl Feb 18 '20 at 05:38
  • If you shared representative examples of files you observe those issues for, we could analyze them and check for the existence of such hints. If they are there, we can describe how to build your own extraction strategies that respect those hints, either for itext 5 or 7. – mkl Feb 18 '20 at 05:43
  • Thank you for your assistance and insight. Unfortunately, I do not have any PDFs that I could provide, as they are proprietary. I have been looking for other PDFs to send instead, but have not found any that exhibit the problems that I am experiencing. These PDFs parse the text as one would expect, in the order that it is displayed and read from one column to the next. – Weequay Feb 18 '20 at 21:10
  • With all that being said, is there an iText 7 version of the TextWithFontExtractionStrategy solution irom the forum thread that I referenced initially? Perhaps it will not exhibit the issues with my PDFs. – Weequay Feb 18 '20 at 21:17

0 Answers0