5

This is more of an algorithmy question - I am not very mathematical so was looking for an engineery solution... If this is off topic for SO let me know and I will delete the question.

I created a mashup of open source goodness to do Optical Character Recognition on difficult backgrounds: https://github.com/metalaureate/tesseract-docker-ocr

I want to use it to scan labels with a pre-defined ID code, e.g., 2826672. The accuracy is about 70% for digits.

Question: how do I add redundancy programmatically to my code to increase accuracy to 99%, and how do I decode it? I can imagine some really kludgy ways, like doubling and inverting the digits, but I don't know how to do this in a way that honors information theory without my having to translate a lot of math.

How do I add and decode digits to correct for OCR errors?

metalaureate
  • 7,572
  • 9
  • 54
  • 93
  • 1
    Sorry, the question is not very clear to me. Do you control the labels? Are you asking how to pick labels such that OCR errors can be corrected for? – Daniel Darabos Feb 04 '15 at 14:58
  • 1
    Yes and yes. The labels are actually on T-shirts in a glyphic code of 7 different symbols that I have trained Google's tesseract OCR engine to detect, equivalent to expressing a number in base 6. I want to know how do I add digits to correct for OCR errors? – metalaureate Feb 04 '15 at 15:06
  • 1
    Some inspiration [here](http://stackoverflow.com/questions/1100730/what-algorithm-to-use-to-calculate-a-check-digit). – 500 - Internal Server Error Feb 04 '15 at 15:54
  • You need to look into error correcting codes then (http://en.wikipedia.org/wiki/Forward_error_correction). There are quite a few libraries for these, so you don't have to implement (or understand) them. Encode your data, print the encoded numbers on the shirts, then decode the possibly corrupted scans to get back the original data. The codes can be tweaked to protect against more or less errors. – Daniel Darabos Feb 04 '15 at 15:58
  • Sweet https://www.npmjs.com/package/fec-stream - please post as answer – metalaureate Feb 04 '15 at 16:14
  • http://stackoverflow.com/questions/7068398/error-recovery-algorithms Looks like Hamming code is the way to go. You can then base64-encode the data and print it. – sharptooth Feb 05 '15 at 13:52

1 Answers1

3

If you have the freedom of actually printing the labels, then there's no real reason to stick with plain ol' numbers. Use QR codes instead. Both the size (information capacity) and information redundancy is configurable, so you can customize it to fit your specific scenario. Internally, Reed-Solomon error correction is used. They offer There are plenty of libraries for both QR code generation and recognition from a scan.

Further info is available in Wikipedia.

Ondrej Tucny
  • 27,626
  • 6
  • 70
  • 90
  • Yes - thanks; I looked into that - but there's a large branding constraint that unavoidably excludes QR code. – metalaureate Feb 04 '15 at 16:11
  • @metalaureate Try to persuade the responsible marketing person QR codes are cute. If that doesn't work, tell the CFO how much failed recognitions will cost. However, if QR codes can't be used, I'd suggest using any other type of bar code: http://en.wikipedia.org/wiki/Barcode – Ondrej Tucny Feb 04 '15 at 16:35
  • Thanks - those positions are all held by yours truly. :) – metalaureate Feb 04 '15 at 16:36
  • Thank you. If they ever see the light of day, you will understand... :) – metalaureate Feb 04 '15 at 16:49