Tesseract OCR German Special Characters

Question

iam using tesseract ocr for reading german png images in c++ and i got problems with some special characters like

ß ä ö ü and so on.

Do i need to train tesseract for reading this correct or what need to be done?

This is the part of the original image read by tesseract

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

UPDATE

SetConsoleOutputCP(1252);//changed to german.
SetConsoleCP(1252);//changed to german
wcout << "ÄÖÜ?ß" << endl;

// Open input image with leptonica library
Pix *image = pixRead("D:\\Images\\Document.png");
api->Init("D:\\TesseractBeispiele\\Tessaractbeispiel\\Tessaractbeispiel\\tessdata", "deu");
api->SetImage(image);
api->SetVariable("save_blob_choices", "T");
api->SetRectangle(1000, 3000, 9000, 9000);
api->Recognize(NULL);

// Get OCR result
wcout << api->GetUTF8Text());

After changing the Code below the Update the hard coded umlauts will be shown correctly, but the text from the image issnt correct, what do i need to change?

tesseract version is 3.0.2 leptonica version is 1.68

score 1 · Accepted Answer · edited May 23 '17 at 11:52

1

Tesseract can recognize Unicode characters. Your console may have not been configured to display them.

What encoding/code page is cmd.exe using?

Unicode characters in Windows command line - how?

edited May 23 '17 at 11:52

Community

1
1

answered Apr 08 '16 at 13:22

nguyenq

8,212
1
16
16

The console almost certainly isn't configured for UTF-8. – MSalters Apr 08 '16 at 13:49
How would you configurate the console for utf8? – Cazzador Apr 08 '16 at 15:08

score 0 · Answer 2 · answered Jun 24 '16 at 07:40

i don't how to detect German the word from the image in windows environment. but i know how to detect German word to Linux environment. following code may get you some idea.

/*
 * word_OCR.cpp
 *
 *  Created on: Jun 23, 2016
 *      Author: root
 */

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>

using namespace std;

int main(int argc ,char **argv)
{
    Pix *image = pixRead(argv[1]);

    if (image == 0) {
        cout << "Cannot load input file!\n";
    }

    tesseract::TessBaseAPI tess;
// insted of the passing "eng" pass "deu".
    if (tess.Init("/usr/share/tesseract/tessdata", "deu")) {
            fprintf(stderr, "Could not initialize tesseract.\n");
            exit(1);
        }

    tess.SetImage(image);
    tess.Recognize(0);

    tesseract::ResultIterator *ri = tess.GetIterator();
    tesseract::PageIteratorLevel level = tesseract::RIL_WORD;

    if(ri!=0)
    {
        do {
            const char *word = ri->GetUTF8Text(level);

            cout << word << endl;

            delete []word;

        } while (ri->Next(level));


        delete []ri;
    }

}
one thing you have to take care that pass good resolution image then and then it works fine.

if you want more accuracy than this then you can pass OTSU threshold image in pixeRead(). i am passing normal image in pixRead() right now. pass OTSU threshold image. i developed algorithm for that. . let me know if anybody want. — pratik solanki, Jun 24 '16 at 07:43

Tesseract OCR German Special Characters

2 Answers2