1

I'm using Tesseract with Python to attempt to read license plates using the function image_to_string(). The license plates include only uppercase alphas and digits. Occasionally, Tesseract misreads digits or uppercase characters as lowercase characters.

I know that I can specify a white list of characters to include only uppercase alphas and digits. What I really want to know is whether the white list causes the OCR algorithm to bypass the white listed characters and continue to try to match the symbol with non-white listed characters, or does it simply cause the image_to_string() function to discard characters that it has interpreted that are not on the white list?

Zizumara
  • 31
  • 5

1 Answers1

0

You can use this config in parameter:

-c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Please explain why `-c` is the appropriate solution. Also, it's important to format for readability. – the Tin Man Jun 09 '23 at 23:17
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 10 '23 at 15:32
  • 1
    I already know how to apply a whitelist. The question is in regards to what the whitelist actually does. Does it simply exclude lower case characters from the result, or does it remove lower case characters from the potential matches? The distinction is important, because simply dropping lower case characters from the result doesn't improve the accuracy of the algorithm. – Zizumara Jun 10 '23 at 17:34
  • yes, they exclude lower case. If you've text in image contains XYZxyz and you apply -c tessedit_char_whitelist="XYZ" it's take only capital not small xyz. – Ronak Patel Jun 19 '23 at 08:48