26

I'm trying to train Tesseract 4 with images instead of fonts.

In the docs they are explaining only the approach with fonts, not with images.

I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.

I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

Mariana
  • 1
  • 2
  • 11
claim
  • 506
  • 6
  • 13

1 Answers1

20

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

These files need to be single lines of text.

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

Eric Ihli
  • 1,722
  • 18
  • 30
  • 1
    Hey, thanks for this answer. A question, I have about 200 pngs for each letter of the alphabet, so should i create the text files as `a_1.gt.txt`, `a_2.g.txt` etc with content "a", with images `a_1.png`, `a_2.png` etc – Akshay Jul 11 '20 at 07:33
  • 2
    file name can be anything only maters is .gt.txt file and .png file name should be same. a_1.gt.txt, a_1.png, a_2.gt.txt , a_2.png is correct. – Eliyaz KL Jul 12 '20 at 11:14
  • Thanks for the answer! Does this allow adding new symbols? Is there a tool that helps creating those image/text files - for example by allowing one to supply a page as image, which should generate the line images and first guesses of the text? – Ant6n Dec 31 '20 at 14:52
  • I don't know of anything that turns a page into images of lines. But if you get that far, a tip for quickly reviewing and cleaning the first guesses is a program called `feh`. `feh` lets you view an image and a caption at the same time and lets you edit the caption from within feh. https://github.com/eihli/image-table-ocr/blob/49205462a3fb68240fd6a3d441ae7cf979b43daa/pdf_table_extraction_and_ocr.org#training-tips – Eric Ihli Dec 31 '20 at 21:04
  • `hocr-extract-images` from "hocr-tools" will convert a `.hocr` file (generated by Tesseract) plus the image to a set of line images/text pairs. – Inductiveload Jul 26 '21 at 03:59