2

I am using Tesseract version 3.0.2.0 and below is my code

string tessDataDir = @"D:\temp";
string ocrOutput= "";

using (var engine = new TesseractEngine(tessDataDir, "eng", EngineMode.Default))
    {
       engine.DefaultPageSegMode = PageSegMode.SingleChar;
       using (var image = Pix.LoadFromFile(imagePath))
         { 
           using (var page = engine.Process(image))
            {
             ocrOutput = page.GetText();
            }
         }
    }

I am getting lots of incorrect characters, sometimes X is being detected as "J" sometimes as "fi", etc.

1) Below JPEG image is being detected as "L" though it is "X", can anyone tell me why it is so?

enter image description here

2) Also how can I disable dictionary use in Tesseract? Thanks.

Sujit Singh
  • 752
  • 1
  • 9
  • 23
  • Tesseract needs some tweaks to work properly. You should try some image processing operations to clean the letters from the image, for example in the image you posted if you can rid of the black line in the bottom it will recognize the X letter. Alternatively you could try some other parameters as `--psm 13`, or you could try to limit the set of characters with a whitelist. – sinecode Mar 29 '18 at 20:03
  • I did lot of pre-processing but nothing helped. I even tried removing border, gray scale, etc. I couldn't find any PSM 13 option in .net tesseract, see screenshot below https://1drv.ms/u/s!Ase2Dy0lrWBLnZhdHXiKgDLvXx9hYQ Any idea? Thanks. – Sujit Singh Mar 30 '18 at 01:59
  • I don't know Tesseract for C#, but you're right, it seems that it doesn't have the `--psm 13` option. Do you already know the set of characters that you have to process? As I tell you before, you could limit the set of characters to look for with the whitelist, see [that](https://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for) question. – sinecode Mar 30 '18 at 07:56

0 Answers0