Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files?

Question

This SO answer suggests that training tesseract with .tif files has an advantage over .png files because the .tif files can have multiple pages and thus a larger training sample. Yet, this SO question discusses procedures for training with multiple images at once. More so, the man page for, e.g. mftraining suggests that it can accept multiple training files.

Is there any reason then not to train with multiple separate image files?

score 2 · Accepted Answer · answered Jun 27 '16 at 10:39

It appears that using multiple images to train tesseract on a single font seems to work just fine. Below is a sketch of the workflow I employ:

# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png

# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox

## correct boxes with jTessBoxEditor or another box editor ##

# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr

tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr

# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box 

echo "MyNewFont 0 0 0 0 0" >> font_properties

# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr 
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr

## rename files
mv inttemp  eng1.inttemp
mv normproto  eng1.normproto
mv pffmtable  eng1.pffmtable
mv shapetable  eng1.shapetable

combine_tessdata eng1. ## create .traineddata file.

score 0 · Answer 2 · answered Jun 25 '16 at 14:16

0

You can certainly train with multiple image files; Tesseract would treat them as having different, separate fonts. And there is a limit (64) on the number of images. If they share a common font, it would be better to put them in a multi-page TIFF. According to its specs, a TIFF file can be a container holding many images.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract https://en.wikipedia.org/wiki/Tagged_Image_File_Format

answered Jun 25 '16 at 14:16

nguyenq

8,212
1
16
16

Will Tesseract necessarily treat them as different fonts? I edited my question to give a workflow that I *think* uses two images to train a single font. Is there something flawed about it though? Thanks! – Michael Ohlrogge Jun 25 '16 at 14:44
I normally train with multi-page TIFF, but your workflow seems to be workable, except it appears to miss a couple steps (commands). – nguyenq Jun 26 '16 at 17:05

Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files?

2 Answers2

Linked