3

I'm having trouble with vertical text mixed with horizontal numbers.

For example:

Text

If that was a single digit it would've been successful but tesseract tries to read this number as a single character since it expects characters to come vertically. I know tesseract gives a confidence factor for the whole sentence and not for every character. Is there a way to identify low confidence on this character only and try something different on it to correctly parse the numbers?

Flux
  • 9,805
  • 5
  • 46
  • 92
K41F4r
  • 1,443
  • 1
  • 16
  • 36
  • Are you using jpn_vert.traineddata? – user3169 Mar 02 '19 at 07:08
  • Yes, I should have mentioned that, it reads the "24" part as one character – K41F4r Mar 02 '19 at 10:22
  • I don't know about the confidence part, but you might address that as a separate question. Like "high confidence" do A, "low confidence" do B, rather than focus on a specific example. – user3169 Mar 04 '19 at 03:30
  • You might also look into page segmentation, to separate the numbers from the kanji. Something like in [How do I segment a document using Tesseract then output the resulting bounding boxes and labels](https://stackoverflow.com/questions/28591117/how-do-i-segment-a-document-using-tesseract-then-output-the-resulting-bounding-b), though it's far beyond anything I've done. – user3169 Mar 04 '19 at 03:37

0 Answers0