0

Im currently working on a project where I have to extract data from pdfs. Right now I have a pretty bad document here, which is really hard to read for OCR Software. So im trying to improve the quality. I managed to improve the quality of the document overall, but like the quality of the data is still in bad shape. Does anyone have an Idea how to fill in the missing pixels.

this is an example

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
schmanh
  • 29
  • 2
  • 1
    "filling in the pixels"? that level of attention won't solve the problem. OCR tech from the last century will have trouble because it requires glyphs to be separate. state of the art (AI, convolutional networks) has no such requirement and can decode such situations trivially. no, I don't know any free programs, only theory and proof of concepts, and if you're asking for industrial applications, there are companies that will sell you their OCR solutions, that are MADE for this. – Christoph Rackwitz Mar 11 '23 at 17:47
  • Tesseract 4 can use LSTM nets for OCR: https://github.com/tesseract-ocr/tesseract. Furthermore, several other OCR approaches use Tensorflow, PyTorch, Keras or other AI toolchains to train neural network for character recognition - a web search might lead you in the right direction. Nevertheless, I did not use any of these tools in real-world production applications and most sources, blog posts etc. I am aware of deal with introductory examples only. – albert Mar 11 '23 at 17:55

2 Answers2

1

I have played a little bit with your picture. Image quality is very important. Find my result:

import subprocess
import cv2
import pytesseract

# Image manipulation
# Commands https://imagemagick.org/script/convert.php
mag_img = r'D:\Programme\ImageMagic\magick.exe'
con_bw = r"D:\Programme\ImageMagic\convert.exe" 

in_file = r'D:\Daten\Programmieren\stackoverflow\figure.png'
out_file = r'D:\Daten\Programmieren\stackoverflow\figure_bw.png'

# Play with black and white and contrast for better results
process = subprocess.run([con_bw, in_file, "-resize", "30%","-threshold","35%", "-brightness-contrast","-20x30", out_file])

# Text ptocessing
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = cv2.imread(out_file)

# Parameters see tesseract doc 
custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=01234567890' 

tex = pytesseract.image_to_string(img, config=custom_config)
print(tex)

with open("cartootn.txt", 'w') as f:
    f.writelines(tex)

cv2.imshow('image',img)
cv2.waitKey(8000)
cv2.destroyAllWindows()

Output: 90860568 enter image description here

Hermann12
  • 1,709
  • 2
  • 5
  • 14
0

The main problem is programming OCR what it needs to do, However you also need to ensure the scale of pixels are reduced to a normal level, try it here with this scale reduced copy. Several studies have shown that too high a resolution is as bad as too low. see Willus (Dotcom) advice https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
https://www.willus.com/author/?tesseract_accuracy
and image processing to improve tesseract OCR accuracy

So ideally this should be 30 px per character height, here I reduced it to double that, so should ideally be halved again.

enter image description here

so for a single image area we can tell it to decode those pixels as a single line and either Segmentation 7 or 13 will do

Microsoft Windows [Version 10.0.19045.2604]
(c) Microsoft Corporation. All rights reserved.

C:\Apps\PDF\Tesseract>tesseract --help-extra
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.


C:\Apps\PDF\Tesseract>tesseract K7577.png - --psm 7

: 90860568

C:\Apps\PDF\Tesseract>tesseract K7577.png - --psm 13

: 90860568

And at reduced scale by half again as part of a PDF page that area should behave without any problem in PDF to OCR.PDF

enter image description here enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36