Tesseract and tiff format - spp not in set {1,3}

Question

While trying to run this command:

tesseract bond111.tif bond111 batch.nochop makebox

I get the next error

Error in pixReadFromTiffStream: spp not in set {1,3}
Error in pixReadStreamTiff: pix not read
Error in pixReadTiff: pix not read

Assuming that spp not in set is the main error here, what does it mean? At first it had trouble because the bpp was higher than 24 so I reduced it using Gimp but that did not resolve the issue.

I see no reason for this question to be closed. The OP gives an explicit software command that they tried and the details on the error they received. Tesseract is a relatively active tag on SO and this is quite relevant to it. Many people (including myself) continue to find this page helpful. Working with Tesseract isn't the same as a lang like python, so questions will look a bit diff. But if Tesseract is accepted as a tag on SO then I see no reason why this question shouldn't be allowed. — Michael Ohlrogge, Jun 27 '16 at 17:20

score 46 · Accepted Answer · answered Apr 18 '12 at 12:33

46

It probably means your TIFF image has an alpha channel and therefore the underlying Leptonica library used by Tesseract doesn't support it. If you're using Imagemagick then be aware that operations such as -draw can cause alpha channels to be added. If you're using convert in your workflow and want to remove the channel again immediately, flatten the image before writing by adding -background white -flatten +matte before the output filename, e.g.:

convert input.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff

Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.

Sources: magick-users mailing list posting; tesseract-ocr mailing list posting

answered Apr 18 '12 at 12:33

ZakW

836
9
8

2

Interesting. This solution works for me, but only generates the last page of the input pdf I have. – mlissner Feb 03 '13 at 22:19
6

Looks like the -flatten command reduces it to a single page. Removing that fixed everything for me. – mlissner Feb 03 '13 at 22:25
19

Note that `+matte` is deprecated according to the docs. Use `-alpha Off` instead. – fotNelton Mar 14 '14 at 05:32
4

@fotNelton Thanks: it's the -alpha Off that did it for me. -flatten is definitely not helpful for a multipage scan – Auspex Jul 24 '15 at 03:11
PNGs worked great for me. Thank you for the suggestion. – Wayne Conrad Sep 13 '16 at 01:16
So far the current tesseract does NOT accept PNG as a single appended file. Nor does tesseract accept a flattened tiff with the above options. – CPlusPlus OOA and D Apr 20 '18 at 15:31

score 19 · Answer 2 · edited Mar 31 '18 at 01:56

19

Thanks for your post ZakW, you pointed me to the right direction. Anyhow i also needed to set '-depth 8'. Quality was not good enough for OCR, whatever I tried.

What worked for me is this solution:

ghostscript -o document.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw document.pdf
tesseract document.tiff document -l deu
vim document.txt

This way I got perfect text with Umlauts in german.

edited Mar 31 '18 at 01:56

cs95

379,657
97
704
746

answered May 31 '12 at 21:10

Florian Lagg

751
2
8
21

This approach is the only one which has successfully converted the content to a text file. It makes no sense why the accepted answer is not working. – CPlusPlus OOA and D Apr 20 '18 at 19:41

score 6 · Answer 3 · answered Dec 17 '18 at 11:29

6

Adjusting the conversion to the following line did help me.

convert -density 300 input.pdf -depth 8 -background white -alpha Off output.tiff

Note that the other answers did not work for me since they use the deprecated +matte flag instead of -alpha Off.

answered Dec 17 '18 at 11:29

Alexander Belokon

1,452
2
17
37

score 5 · Answer 4 · answered Feb 19 '12 at 15:29

You can try using the command 'tiffinfo' provided by libtiff_tools to verify the TIFF format of your src image. A number of TIFF formats exist, with different values for Bits-per-pixel (bpp) and Samples-per-pixel (spp).

Error in pixReadFromTiffStream: spp not in set {1,3,4}

An 'spp' value of 2 is invalid for TIFF.

I solved the problem by saving directly to TIFF format from Gimp, instead of converting from .png to .tif using ImageMagick's 'convert'.

Tesseract and tiff format - spp not in set {1,3}

4 Answers4

Linked