60

I ve been searching for a while and all that i ve seen some OCR library requests. I would like to know how to implement the purest, easy to install and use OCR library with detailed info for installation into a C# project.

If posible, I just wanna implement it like a usual dll reference...

Example:

using org.pdfbox.pdmodel;
using org.pdfbox.util;

Also a little OCR code example would be nice, such as:

public string OCRFromBitmap(Bitmap Bmp)
{
    Bmp.Save(temppath, System.Drawing.Imaging.ImageFormat.Tiff);
    string OcrResult = Analyze(temppath);
    File.Delete(temppath);
    return OcrResult;
}

So please consider that I'm not familiar to OCR projects and give me an answer like talking to a dummy.

Edit: I guess people misunderstood my request. I wanted to know how to implement those open source OCR libraries to a C# project and how to use them. The link given as dup is not giving answers that I requested at all.

MX D
  • 2,453
  • 4
  • 35
  • 47
Berker Yüceer
  • 7,026
  • 18
  • 68
  • 102

5 Answers5

134

If anyone is looking into this, I've been trying different options and the following approach yields very good results. The following are the steps to get a working example:

  1. Add .NET Wrapper for tesseract to your project. It can be added via NuGet package Install-Package Tesseract(https://github.com/charlesw/tesseract).
  2. Go to the Downloads section of the official Tesseract project (https://code.google.com/p/tesseract-ocr/ EDIT: It's now located here: https://github.com/tesseract-ocr/langdata).
  3. Download the preferred language data, example: tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02.
  4. Create tessdata directory in your project and place the language data files in it.
  5. Go to Properties of the newly added files and set them to copy on build.
  6. Add a reference to System.Drawing.
  7. From .NET Wrapper repository, in the Samples directory copy the sample phototest.tif file into your project directory and set it to copy on build.
  8. Create the following two files in your project (just to get started):

Program.cs

using System;
using Tesseract;
using System.Diagnostics;

namespace ConsoleApplication
{
    class Program
    {
        public static void Main(string[] args)
        {
            var testImagePath = "./phototest.tif";
            if (args.Length > 0)
            {
                testImagePath = args[0];
            }

            try
            {
                var logger = new FormattedConsoleLogger();
                var resultPrinter = new ResultPrinter(logger);
                using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
                {
                    using (var img = Pix.LoadFromFile(testImagePath))
                    {
                        using (logger.Begin("Process image"))
                        {
                            var i = 1;
                            using (var page = engine.Process(img))
                            {
                                var text = page.GetText();
                                logger.Log("Text: {0}", text);
                                logger.Log("Mean confidence: {0}", page.GetMeanConfidence());

                                using (var iter = page.GetIterator())
                                {
                                    iter.Begin();
                                    do
                                    {
                                        if (i % 2 == 0)
                                        {
                                            using (logger.Begin("Line {0}", i))
                                            {
                                                do
                                                {
                                                    using (logger.Begin("Word Iteration"))
                                                    {
                                                        if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
                                                        {
                                                            logger.Log("New block");
                                                        }
                                                        if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
                                                        {
                                                            logger.Log("New paragraph");
                                                        }
                                                        if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
                                                        {
                                                            logger.Log("New line");
                                                        }
                                                        logger.Log("word: " + iter.GetText(PageIteratorLevel.Word));
                                                    }
                                                } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
                                            }
                                        }
                                        i++;
                                    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
                                }
                            }
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Trace.TraceError(e.ToString());
                Console.WriteLine("Unexpected Error: " + e.Message);
                Console.WriteLine("Details: ");
                Console.WriteLine(e.ToString());
            }
            Console.Write("Press any key to continue . . . ");
            Console.ReadKey(true);
        }



        private class ResultPrinter
        {
            readonly FormattedConsoleLogger logger;

            public ResultPrinter(FormattedConsoleLogger logger)
            {
                this.logger = logger;
            }

            public void Print(ResultIterator iter)
            {
                logger.Log("Is beginning of block: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Block));
                logger.Log("Is beginning of para: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Para));
                logger.Log("Is beginning of text line: {0}", iter.IsAtBeginningOf(PageIteratorLevel.TextLine));
                logger.Log("Is beginning of word: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Word));
                logger.Log("Is beginning of symbol: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Symbol));

                logger.Log("Block text: \"{0}\"", iter.GetText(PageIteratorLevel.Block));
                logger.Log("Para text: \"{0}\"", iter.GetText(PageIteratorLevel.Para));
                logger.Log("TextLine text: \"{0}\"", iter.GetText(PageIteratorLevel.TextLine));
                logger.Log("Word text: \"{0}\"", iter.GetText(PageIteratorLevel.Word));
                logger.Log("Symbol text: \"{0}\"", iter.GetText(PageIteratorLevel.Symbol));
            }
        }
    }
}

FormattedConsoleLogger.cs

using System;
using System.Collections.Generic;
using System.Text;
using Tesseract;

namespace ConsoleApplication
{
    public class FormattedConsoleLogger
    {
        const string Tab = "    ";
        private class Scope : DisposableBase
        {
            private int indentLevel;
            private string indent;
            private FormattedConsoleLogger container;

            public Scope(FormattedConsoleLogger container, int indentLevel)
            {
                this.container = container;
                this.indentLevel = indentLevel;
                StringBuilder indent = new StringBuilder();
                for (int i = 0; i < indentLevel; i++)
                {
                    indent.Append(Tab);
                }
                this.indent = indent.ToString();
            }

            public void Log(string format, object[] args)
            {
                var message = String.Format(format, args);
                StringBuilder indentedMessage = new StringBuilder(message.Length + indent.Length * 10);
                int i = 0;
                bool isNewLine = true;
                while (i < message.Length)
                {
                    if (message.Length > i && message[i] == '\r' && message[i + 1] == '\n')
                    {
                        indentedMessage.AppendLine();
                        isNewLine = true;
                        i += 2;
                    }
                    else if (message[i] == '\r' || message[i] == '\n')
                    {
                        indentedMessage.AppendLine();
                        isNewLine = true;
                        i++;
                    }
                    else
                    {
                        if (isNewLine)
                        {
                            indentedMessage.Append(indent);
                            isNewLine = false;
                        }
                        indentedMessage.Append(message[i]);
                        i++;
                    }
                }

                Console.WriteLine(indentedMessage.ToString());

            }

            public Scope Begin()
            {
                return new Scope(container, indentLevel + 1);
            }

            protected override void Dispose(bool disposing)
            {
                if (disposing)
                {
                    var scope = container.scopes.Pop();
                    if (scope != this)
                    {
                        throw new InvalidOperationException("Format scope removed out of order.");
                    }
                }
            }
        }

        private Stack<Scope> scopes = new Stack<Scope>();

        public IDisposable Begin(string title = "", params object[] args)
        {
            Log(title, args);
            Scope scope;
            if (scopes.Count == 0)
            {
                scope = new Scope(this, 1);
            }
            else
            {
                scope = ActiveScope.Begin();
            }
            scopes.Push(scope);
            return scope;
        }

        public void Log(string format, params object[] args)
        {
            if (scopes.Count > 0)
            {
                ActiveScope.Log(format, args);
            }
            else
            {
                Console.WriteLine(String.Format(format, args));
            }
        }

        private Scope ActiveScope
        {
            get
            {
                var top = scopes.Peek();
                if (top == null) throw new InvalidOperationException("No current scope");
                return top;
            }
        }
    }
}
B.K.
  • 9,982
  • 10
  • 73
  • 105
  • 4
    I wish I could vote more than once because this is such a good instruction to get that thing running. – BloodyRain2k Dec 25 '15 at 19:06
  • 2
    @BloodyRain2k I'm glad that you found it useful. Thank you for the kind words. – B.K. Dec 25 '15 at 22:29
  • 2
    I used the link you mentioned above. In eng folder (https://github.com/tesseract-ocr/langdata/tree/master/eng) One file is missing i.e. eng.traineddata. Please add this file too. – Mughees Musaddiq Feb 26 '16 at 15:09
  • 2
    @MugheesMusaddiq They keep on changing the files a lot, that's why I was reluctant to put any links, as they're not guaranteed to be the same down the line. This is meant more as a guide on how to get started and the lack of link guarantee is why I've pasted so much code here. – B.K. Feb 27 '16 at 01:25
  • 1
    I found the eng.traineddata here (https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata). Plugged it in and it worked. ...if you aren't good with github just click the Raw button and it will download the file. –  May 17 '16 at 17:19
  • 2
    Old versions of the language data can be downloaded here: https://sourceforge.net/projects/tesseract-ocr-alt/files/ (e.g. because as of right now the NuGet package is of version 3.02 and the only language data available on the site linked bove is 3.04; alternatively the Wayback Machine can be used) – mYnDstrEAm Aug 16 '16 at 07:35
  • This is a good answer and works perfectly - any idea if it's possible for Tesseract to read PDF documents? – blueprintchris Nov 07 '16 at 09:19
  • Where/How are you putting the tessearact language files into your project? Can you provide a working example that one can download? – Christine Apr 11 '17 at 18:10
  • @Hill Steps #4 and #5. – B.K. Apr 12 '17 at 02:03
  • 1
    Yeah, I did that... thanks. I actually put in a ticket to this issue for the creator on GitHub, and he told me that your link to the language files is incorrect. See https://github.com/charlesw/tesseract/issues/339 – Christine Apr 12 '17 at 15:51
  • @Hill Yeah, that's the issue with links, as I mention in an earlier comment -- especially when they're a few years old. – B.K. Apr 12 '17 at 16:34
  • 1
    Oh, I hate when I Google a problem, and the first answer I come to says "This has already been answered," with a link to a post that doesn't exist any more. – Christine Apr 12 '17 at 16:59
  • Could it be possible to provide a picture of the solution explorer after having imported the language files ? I currently have: `MySolution > MyProject > tessdata > desired_characters, fra.bad_words, ...` – Mat Dec 26 '17 at 19:16
  • link to language file for version 3.30 - https://github.com/tesseract-ocr/tessdata/tree/master – Ariwibawa Mar 15 '19 at 07:42
12

Here's one: (check out http://hongouru.blogspot.ie/2011/09/c-ocr-optical-character-recognition.html or http://www.codeproject.com/Articles/41709/How-To-Use-Office-2007-OCR-Using-C for more info)

using MODI;
static void Main(string[] args)
{
    DocumentClass myDoc = new DocumentClass();
    myDoc.Create(@"theDocumentName.tiff"); //we work with the .tiff extension
    myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);

    foreach (Image anImage in myDoc.Images)
    {
        Console.WriteLine(anImage.Layout.Text); //here we cout to the console.
    }
}
Rob P.
  • 14,921
  • 14
  • 73
  • 109
  • 1
    How do I get MODI? I do have Microsoft Office 2010 & 2013 installed. – mYnDstrEAm Aug 16 '16 at 07:33
  • I have MS office but the references can't be resolved (they have the yellow warning triangle) and the project therefore won't build). – Ewan Mar 23 '17 at 13:49
7

I'm using tesseract OCR engine with TessNet2 (a C# wrapper - http://www.pixel-technology.com/freeware/tessnet2/).

Some basic code:

using tessnet2;

...

Bitmap image = new Bitmap(@"u:\user files\bwalker\2849257.tif");
            tessnet2.Tesseract ocr = new tessnet2.Tesseract();
            ocr.SetVariable("tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,$-/#&=()\"':?"); // Accepted characters
            ocr.Init(@"C:\Users\bwalker\Documents\Visual Studio 2010\Projects\tessnetWinForms\tessnetWinForms\bin\Release\", "eng", false); // Directory of your tessdata folder
            List<tessnet2.Word> result = ocr.DoOCR(image, System.Drawing.Rectangle.Empty);
            string Results = "";
            foreach (tessnet2.Word word in result)
            {
                Results += word.Confidence + ", " + word.Text + ", " + word.Left + ", " + word.Top + ", " + word.Bottom + ", " + word.Right + "\n";
            }
Ben Walker
  • 2,037
  • 5
  • 34
  • 56
  • 1
    In your link, there is another link "Download binary here" and it doesn't work. In fact this link is on many websites and it doesn't work on any of them. Does anyone know where the tessnet2.dll can be downloaded from? – Ewan Mar 22 '17 at 13:13
  • 2
    I actually found tessnet2 in NuGet, not sure why I didn't look there first. It stops on the ocr.Init line when I run it though.Is there meant to be something specific in that directory? tessnet2_32.dll is in my "tessdata" folder as is my application exe file. Any idea why it stops? It simply doesn't do anything. – Ewan Mar 23 '17 at 15:30
3

Some online API's work pretty well: ocr.space and Google Cloud Vision. Both of these are free, as long as you do less than 1000 OCR's per month. You can drag & drop an image to do a quick manual test to see how they perform for your images.

I find OCR.space easier to use (no messing around with nuget libraries), but, for my purpose, Google Cloud Vision provided slightly better results than OCR.space.

Google Cloud Vision example:

GoogleCredential cred = GoogleCredential.FromJson(json);
Channel channel = new Channel(ImageAnnotatorClient.DefaultEndpoint.Host, ImageAnnotatorClient.DefaultEndpoint.Port, cred.ToChannelCredentials());
ImageAnnotatorClient client = ImageAnnotatorClient.Create(channel);
Image image = Image.FromStream(stream);

EntityAnnotation googleOcrText = client.DetectText(image).First();
Console.Write(googleOcrText.Description);

OCR.space example:

string uri = $"https://api.ocr.space/parse/imageurl?apikey=helloworld&url={imageUri}";
string responseString = WebUtilities.DoGetRequest(uri);
OcrSpaceResult result = JsonConvert.DeserializeObject<OcrSpaceResult>(responseString);
if ((!result.IsErroredOnProcessing) && !String.IsNullOrEmpty(result.ParsedResults[0].ParsedText))
  return result.ParsedResults[0].ParsedText;
Jimmy
  • 5,131
  • 9
  • 55
  • 81
1

A new API is OcrEngine.RecognizeAsync from WinRT/UWP. It can also be used in WinForms:

...
//for AsBuffer
using System.Runtime.InteropServices.WindowsRuntime;
...

    private async Task<SoftwareBitmap> loadSoftwareBitmap(string fn)
    {
        var sf= await StorageFile.GetFileFromPathAsync(fn);
        SoftwareBitmap sb;
        using (IRandomAccessStream stream = await sf.OpenAsync(FileAccessMode.Read))
        {         
            BitmapDecoder decoder = await BitmapDecoder.CreateAsync(stream);
            sb = await decoder.GetSoftwareBitmapAsync();
        }
        return sb;
    }

async private void button5_Click(object sender, EventArgs e)
{
    OcrEngine ocrEngine = null;
    ocrEngine = OcrEngine.TryCreateFromUserProfileLanguages();
    if (ocrEngine == null) return;

    var fn = @"1.png";            

    var outputBitmap =await loadSoftwareBitmap(fn);

    var ocrResult = await ocrEngine.RecognizeAsync(outputBitmap);                        
}

To use WinRT/UWP API in WinForms, add Nuget package "Microsoft.Windows.SDK.Contracts" (version 10.0.17134.100 for Win10 1803 SDK tested here) as described here.

Edit: Previous version is very slow and most time is consumed by the conversion between Image and SoftwareBitmap. The new version load SoftwareBitmap directly and it is very fast. The result is very good although it cannot recognize chars on curved paper (as I know many other framework can't do OCR on curved paper directly too)

jw_
  • 1,663
  • 18
  • 32
  • Doesn't work for .Net Framework 4.8 – ThexBasic Nov 13 '22 at 21:30
  • @ThexBasic I'v check my test project and real project, they are indeed both .NET Framework 4.8 on VS2017. So what is exactly the problem you encountered? – jw_ Nov 16 '22 at 00:24
  • Even if I add the nuget Package, the "using" is still missing & cannot resolve somehow. – ThexBasic Nov 17 '22 at 07:05
  • @ThexBasic Check your environment, please use version 10.0.17134.100 (the oldest) of Microsoft.Windows.SDK.Contracts. When I try to use the latest version, it can't even be installed, that may be your reason. VS2017 15.9.6. I can create a new WinForm .NET framework 4.8 app and install that version and copy the above code and success. also use these:using System.Runtime.InteropServices.WindowsRuntime; using Windows.Media.Ocr; using Windows.Graphics.Imaging; – jw_ Nov 20 '22 at 11:50
  • From two tests, this works even better than Tesseract, but combined, it will be even better. The only caveat about this: cannot work outside Windows. – Master DJon Feb 20 '23 at 06:08