How can I extract text from a scanned PDF document with C#?

Question

I am developing a small app with C# and .NET to automate a process which is currently done manually. The app is looking for a particular pattern in a PDF document and uploads it wherever it needs to be according to the pattern. It works without any issues with PDFs, which have been written digitally(Word, Nodepad, etc...) and then converted to PDF.

I later found out that the documents which will be used in the future will be 90% scanned documents. This turned out to be an issue a lot larger than I expected. I found multiple third-party libraries which can handle this task -> iText7, LeadTools, ABBYY, WhatsMate PDF-to-text API, SautinSoft .NET Offce Edition. The issue is, they are all paid and I cannot afford any of them.

I got an idea to convert the PDF to any image type (jpg, png, tiff, etc.) and use Tesseract OCR to recognize the text. The issue is, I cannot find a free-to-use library to convert to image type.

I am asking for any advice on the topic. Is is possible to extract text from scanned PDF for free? Or, is it possible to convert the PDF to an image type and use OCR for free?

Thank you for your time and I apologize if I did not format my question the right way.

As I know, Image Magick can convert pdf to image and it's free. There is also managed wrapper called Magick.NET. You can see some samples here https://stackoverflow.com/questions/2916555/converting-pdf-to-images-using-imagemagick-net-how-to-set-the-dpi — Serg, Jan 05 '21 at 08:31
you can easily read pdfs in your C# program through this library, https://www.nuget.org/packages/iTextSharp/. I have used it multiple times in different projects. — Syed Muhammad Munis Ali, Jan 05 '21 at 08:41
@Beso you [highlighted everything](https://stackoverflow.com/review/suggested-edits/28003660) that was a slightly programming-related term (C#, .NET, PDF). This does not increase readability. See for example [Inline Code Spans should not be used for emphasis, right?](https://meta.stackexchange.com/questions/135112/inline-code-spans-should-not-be-used-for-emphasis-right?noredirect=1&lq=1). — CodeCaster, Jan 05 '21 at 09:16
*"I got an idea to convert the PDF to any image type"* - depending on the nature of the scanned pages you don't need to *convert* to bitmap, it suffices to *extract* the scanned bitmap embedded in the PDF page. — mkl, Jan 05 '21 at 09:18
@Beso no, you don't have to do anything. The text is perfectly readable as is, there is no need to put emphasis on random keywords throughout the text. — CodeCaster, Jan 05 '21 at 09:28

score 0 · Answer 1 · answered Nov 18 '22 at 10:11

The free solution is NicomSoft OCR. You may find website in internet. But the part of code: Load scanned PDF -> Enable OCR -> Save the result (Docx or HTML):

        string pdfFile = pdfPath;
        string outFile = String.Empty;

        f.OpenPdf(pdfFile);
        if (f.PageCount > 0)
        {
            // To Docx.
            outFile = "Result.docx";
            f.WordOptions.Format = PdfFocus.CWordOptions.eWordDocument.Docx;
            if (f.ToWord(outFile) == 0)
                System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(outFile) { UseShellExecute = true });

            // To HTML.
            outFile = "Result.html";
            f.HtmlOptions.KeepCharScaleAndSpacing = false;
            if (f.ToHtml(outFile) == 0)
                System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(outFile) { UseShellExecute = true });
        }
        else
        {
            Console.WriteLine("Error: {0}!", f.Exception.Message);
            Console.ReadLine();
        }
    }
    public static byte[] PerformOCRNicomsoft(byte[] image)
    {
        NSOCRLib.NSOCRClass NsOCR;
        int CfgObj = 0;
        int OcrObj = 0;
        int ImgObj = 0;
        int SvrObj = 0;

        NsOCR = new NSOCRLib.NSOCRClass();
        NsOCR.Engine_SetLicenseKey("AB2A4DD5FF2A"); //required for licensed version only
        NsOCR.Engine_InitializeAdvanced(out CfgObj, out OcrObj, out ImgObj);

        // Scale
        NsOCR.Cfg_SetOption(CfgObj, TNSOCR.BT_DEFAULT, "ImgAlizer/AutoScale", "0");
        NsOCR.Cfg_SetOption(CfgObj, TNSOCR.BT_DEFAULT, "ImgAlizer/ScaleFactor", "4.0");

        NsOCR.Cfg_SetOption(CfgObj, TNSOCR.BT_DEFAULT, "Languages/English", "1");

        try
        {
            int res = 0;


            Array imgArray = null;
            using (MemoryStream ms = new MemoryStream(image))
            {
                ms.Flush();
                imgArray = ms.ToArray();
            }
            res = NsOCR.Img_LoadFromMemory(ImgObj, ref imgArray, imgArray.Length);
            if (res > TNSOCR.ERROR_FIRST)
                return null;

            NsOCR.Svr_Create(CfgObj, TNSOCR.SVR_FORMAT_PDF, out SvrObj);
            NsOCR.Svr_NewDocument(SvrObj);

            res = NsOCR.Img_OCR(ImgObj, TNSOCR.OCRSTEP_FIRST, TNSOCR.OCRSTEP_LAST, TNSOCR.OCRFLAG_NONE);
            if (res > TNSOCR.ERROR_FIRST)
                return null;




            res = NsOCR.Svr_AddPage(SvrObj, ImgObj, TNSOCR.FMT_EXACTCOPY);
            if (res > TNSOCR.ERROR_FIRST) return null;

            Array outPdf = null;
            NsOCR.Svr_SaveToMemory(SvrObj, out outPdf);

            return (byte[])outPdf;
        }
        finally
        {

        }
    }

How can I extract text from a scanned PDF document with C#?

1 Answers1