0

I created a simple method that extract text from PDF file and inserts that text into a txt file. The issue, it only extracts the text of the pdf not the text from the images that are inserted in the PDF. I tried this link but did not understand how to implement. This code works fine if you are only interested in text.

//usings
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System.IO;
using System.Text;
//code
 string file = @"C:\test.pdf";
            string extension = Path.GetExtension(file);
            var pageText = new StringBuilder();

            if (extension == ".pdf")
            {
                using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(file)))
                {
                    var pageNumbers = pdfDocument.GetNumberOfPages();
                    StreamWriter sw = new StreamWriter(@"C:\output.txt");

                    for (int i = 1; i <= pageNumbers; i++)
                    {
                        LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                        PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
                        parser.ProcessPageContent(pdfDocument.GetFirstPage());
                        pageText.Append(strategy.GetResultantText());
                        string name = pageText.ToString();
                        sw.WriteLine(name);
                    }
                    sw.Close();

                }
            }

I feel the issue is very simple but I can't figure it out.

  • *"I tried this [link](https://itextpdf.com/en/blog/technical-notes/how-use-itext-pdfocr-recognize-text-scanned-documents#:%7E:text=iText%20pdfOCR%20accepts%20input%20from,text%20you%20need%20to%20access.) but did not understand how to implement."* - probably you should explain what exactly you did not understand. Because OCR essentially is what you'll have to do. – mkl Jun 02 '21 at 16:49
  • @mkl to start at the top using iText.Pdfocr; and using iText.Pdfocr.Tesseract4; can not be found. – beNiceWeAlLearning Jun 03 '21 at 12:44
  • @KJ thank you for the info but I am still lost. – beNiceWeAlLearning Jun 03 '21 at 12:44

1 Answers1

0

First of all, let me explain why your approach doesn't work: when procesing page content via PdfCanvasProcessor#processPageContent iText processes the pages' content streams and not the imageXObjects which could be mentioned there.

So the question is: how to ocr such images? This question, however, could be split into two:

  1. How to find/extract all the document's images?
  2. How to ocr them?
  1. There are several iText examples on the web, in which it's shown how this could be achieved. This is the option described in one of iText's samples: https://github.com/itext/i7js-book/blob/develop/src/main/java/com/itextpdf/samples/book/part4/chapter15/Listing_15_30_ExtractImages.java There are several SO answers, which you might want to check as well, for example, this one: How to extract images from a PDF with iText in the correct order?

  2. Several open source libraries could be utilized to perform this task: for example, iText's pdfOCR. It gives an opportunity either to ocr an image and wrap it to PDF (or PDF/A) or to just ocr an image. A good starting point: https://github.com/itext/i7j-pdfocr/blob/develop/pdfocr-api/src/test/java/com/itextpdf/pdfocr/ApiTest.java

A hint on your issue with pdfocr classes being not loaded: perhaps you missed the fact that pdfOCR is a separate library: you should add a dependency not on iTextCore, but on pdfOCR itself.

Uladzimir Asipchuk
  • 2,368
  • 1
  • 9
  • 19