How to create a TIFF file that's readable by tesseract OCR?

Question

I want to let tesseract ORC run over an image file, to scan the content.
The problem seems to be that tesseract not only requires TIFF, but it also requires the tiff file to be in a certain format.

With just a normal tiff file, I get:

root@toshiba:~/Desktop# tesseract crap.tif crap.txt
Tesseract Open Source OCR Engine
check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:32
Segmentation fault

So far I have managed to find an antidote.
It consists of using GIMP, going to Image > Mode > Indexes, and setting "Generate Optimum Palette", "maximum number of colors" to 256.
enter image description here
then I have to do one more trick before "Save As".
Going to Layer > Transparency > Remove Alpha Channel, which will remove transparency, because TIF images cannot have transparency.

enter image description here

Now the problem is my input image comes from C#, and is preprocessed with AFORGE.NET image analysis filters.

I have also found a .NET port of LibTiff, and an example of how to write an image with color palette here:
http://bitmiracle.com/libtiff/help/create-tiff-with-palette-(color-map).aspx

But I don't know how to get the data from the source tiff (the one with the wrong palette) to the target tiff (with the correct palette format)...

score 2 · Accepted Answer · answered Jan 25 '12 at 22:50

I've heard that tesseract is fine with grayscale TIFFs.

So please try following code for conversion of your TIFF images to grayscale ones:

using (Tiff tif = Tiff.Open(@"input.tif", "r"))
{
    FieldValue[] value = tif.GetField(TiffTag.IMAGEWIDTH);
    int width = value[0].ToInt();

    value = tif.GetField(TiffTag.IMAGELENGTH);
    int height = value[0].ToInt();

    int xresolution = -1;
    value = tif.GetField(TiffTag.XRESOLUTION);
    if (value != null)
        xresolution = value[0].ToInt();

    int yresolution = -1;
    value = tif.GetField(TiffTag.YRESOLUTION);
    if (value != null)
        yresolution = value[0].ToInt();

    int[] raster = new int[height * width];
    if (!tif.ReadRGBAImageOriented(width, height, raster, Orientation.TOPLEFT))
    {
        System.Windows.Forms.MessageBox.Show("Could not read image");
        return;
    }

    string fileName = "grayscale.tif";
    using (Tiff output = Tiff.Open(fileName, "w"))
    {
        output.SetField(TiffTag.IMAGEWIDTH, width);
        output.SetField(TiffTag.IMAGELENGTH, height);
        output.SetField(TiffTag.ROWSPERSTRIP, 1);
        output.SetField(TiffTag.SAMPLESPERPIXEL, 1);
        output.SetField(TiffTag.BITSPERSAMPLE, 8);
        output.SetField(TiffTag.PLANARCONFIG, PlanarConfig.CONTIG);
        output.SetField(TiffTag.COMPRESSION, Compression.LZW);
        output.SetField(TiffTag.FILLORDER, FillOrder.MSB2LSB);
        output.SetField(TiffTag.PHOTOMETRIC, Photometric.MINISBLACK);

        if (xresolution != -1 && yresolution != -1)
        {
            output.SetField(TiffTag.XRESOLUTION, xresolution);
            output.SetField(TiffTag.YRESOLUTION, yresolution);
        }

        byte[] samples = new byte[width];
        for (int y = 0, index = 0; y < height; y++)
        {
            for (int x = 0; x < width; x++)
            {
                int rgb = raster[index++];

                // compute pixel brightness taking human eye's sensitivity
                // to each of red, green and blue colors into account
                byte gray = (byte)(Tiff.GetR(rgb) * 0.299 + Tiff.GetG(rgb) * 0.587 + Tiff.GetB(rgb) * 0.114);

                // Alternative formulas for RGB -> Gray conversion

                //byte gray = (byte)(Tiff.GetR(rgb) * 0.2125 + Tiff.GetG(rgb) * 0.7154 + Tiff.GetB(rgb) * 0.0721);
                //byte gray = (byte)((Tiff.GetR(rgb) + Tiff.GetG(rgb) + Tiff.GetB(rgb)) / 3);

                samples[x] = gray;
            }

            output.WriteEncodedStrip(y, samples, samples.Length);
        }
    }
}

Hopefully, it will do the trick.

score 2 · Answer 2 · answered Mar 25 '12 at 10:54

I had the same problem with Tesseract, but thanks to your advice, I just used GIMP to change the .tif from a color file into greyscale. This is easily done by using the command Image mode - greyscale, and then saving as a tif again. Hope this helps someone who doesn't want to use the command line to fix the image problem.

How to create a TIFF file that's readable by tesseract OCR?

2 Answers2