Using iText7 (v8.0.0) I am attempting to parse a (non-PDF/A) PDF. The code is as follows:
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));
var page = pdfDocument.GetPage(1);
var text = PdfTextExtractor.GetTextFromPage(page);
All is well in terms of the output except that certain words in the output text have spurious spaces in them. I understand that this is to do with the way text is rendered into the PDF. The GetTextFromPage
method has an overload that takes a text extraction strategy; I tried both the default strategy implementations LocationTextExtractionStrategy
and SimpleTextExtractionStrategy
but neither dealt with the issue.
I am guessing that I need to define my own text extraction strategy but it isn't very obvious how to go about doing this.
(In case readers are interested, I tried the same with IronPDF and that was no better.)