0

I have this picture of characters evenly separated:

enter image description here

and using cv2 I inverted it to this:

enter image description here

and did some contouring around the letters to help the OCR. But when I run the image_to_string, the text I'm left with has some lines almost completely missing.

E
IN IA
ES
RVMARABILLARRBAGAZ
EARAVARGQNGUESUSAV
ANNA
AQCOOLLEMREVVCEGAO
ZUVAGOLEBONNABAL XL
REOORMOBILEJAHABAQ
IE II
VRBAONVTVFORÑEBIEP
O00EGREELOVCAVRDLA
A
IN A
EOLREBELAROSBTLVAS
TI
A |

For the output I'm using data = pytesseract.image_to_string(invimage, lang='spa',config='--psm 6'), in spanish to get the "Ñ" char. Any tips on what I'm doing wrong?

dboy
  • 1,004
  • 2
  • 16
  • 24
Sage
  • 1
  • Did you do any research? I believe tesseract works best with black text on a white background - have you tried with the unmodified image? There’s a comprehensive guide on how to improve image for best recognition here https://tesseract-ocr.github.io/tessdoc/ImproveQuality also see other questions here on StackOverflow like https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy#10034214 – DisappointedByUnaccountableMod Jul 01 '20 at 19:09
  • Yes I did research and of course tried with the original image, and got worse results. The modified image I got is pretty clean but right now I'm trying to get rid of some noise. – Sage Jul 01 '20 at 19:16
  • It’s customary to mention your research in your question and why it didn’t turn up anything useful, so readers won’t waste time repeating it. – DisappointedByUnaccountableMod Jul 01 '20 at 19:48
  • I got much better results than your using psm 6 on the original image. Including the Ñ. I have Tesseract 4.0. – bfris Jul 04 '20 at 03:25

1 Answers1

0

I too am a new contributor, so please forgive me for any kind of misleadings or incorrect answers. I have tried to extract the text from your image and the results were pretty good.here is the output image with bounding boxes

I have used image_to_data function instead of image_to_string to get the confidence value of each line of text.

Output:

QCCOVARDECRATOBHÍv
CHIBOVZINREVÁVRWTOI # recognized an extra O at the last
VULTOOGCONVOIBORGO
RVMARABILLARRBAGAZ
EARAVARGOQONGUESUSA
V
BSVKOZNAVARAGVÚCTL
AQCOOLLEMREVVCEGAO
ZUVAGOLEBONNABALXL
REODORMOBILEJAHABACQ
EIBBTAODORVICAAOSVR
VRBAONVTVFORÑEBIER
OO0ODEGREELOVCAVRDLA
GBCBTOTBLEOOATXMIAQ
SVALAVANELVOILOVNJ
EOLREBELAROSBTLVAS
VASTORETAVALEARTYW
ADOVNGRAVATAMJREÓ
Í

Still, there are a few incorrect recognitions like the Spanish-U in the 5th line of the image. Tesseract even added a few characters.

Here is the code in python:

custom_oem_psm_config =  r'--oem 3 --psm 6'
ocr = pytesseract.image_to_data(otsu, output_type=Output.DICT,config=custom_oem_psm_config,lang='spa')
boxes = len(ocr['text'])
texts = []
for i in range(boxes):
    if (int(ocr['conf'][i]) != -1):
        (x,y,w,h) = (ocr['left'][i],ocr['top'][i],ocr['width'][i],ocr['height'][i])
        cv2.rectangle(img_copy,(x,y),(x+w,y+h),(255,0,0),2)
        texts.append(ocr['text'][i])
    
def list_to_string(list):
    str1 = "\n"
    return str1.join(list)

string = list_to_string(texts)
print("String: ",string)

Thank you

Tarun Chakitha
  • 406
  • 3
  • 7
  • Thank you, I already solved it by extracting the rectangled letters in single images and then using image_to_string on that but I liked your approach. – Sage Jul 09 '20 at 15:11