I created a simple method that extract text from PDF file and inserts that text into a txt file. The issue, it only extracts the text of the pdf not the text from the images that are inserted in the PDF. I tried this link but did not understand how to implement. This code works fine if you are only interested in text.
//usings
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System.IO;
using System.Text;
//code
string file = @"C:\test.pdf";
string extension = Path.GetExtension(file);
var pageText = new StringBuilder();
if (extension == ".pdf")
{
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(file)))
{
var pageNumbers = pdfDocument.GetNumberOfPages();
StreamWriter sw = new StreamWriter(@"C:\output.txt");
for (int i = 1; i <= pageNumbers; i++)
{
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDocument.GetFirstPage());
pageText.Append(strategy.GetResultantText());
string name = pageText.ToString();
sw.WriteLine(name);
}
sw.Close();
}
}
I feel the issue is very simple but I can't figure it out.