1

I have a simple program (code from the documentation of the docTR library) that recognizes text in a pdf file. If the text is perfectly aligned, then there are no problems with text recognition, but if the document is rotated to the right or left, then problems begin with text recognition.

enter image description here enter image description here

I may receive documents that are not only rotated exactly 90,180 or 270 degrees. Crooked scanned documents can come rotated in any angle (as in the pictures above).

I would like with your help to find a solution that will help me rotate the table / text (or the whole pdf) in my pdf straight, for easy text recognition, as in the picture below.

enter image description here

Perhaps there are already similar solutions, but I have not found them yet. I would be grateful if you point me to existing solutions or help me write code with my own solution.

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

ocr = ocr_predictor(pretrained=True)

doc = DocumentFile.from_pdf("my/path.pdf")
result = ocr(doc)
result.show(doc)
Paul
  • 53
  • 3
  • 21
  • 1
    you might be overthinking your problem here. If the type of documents you are trying to parse must have a table (which your example is) then you could simply use opencv to find lines. Which you can then rotate to an orthogonal angle. And then process 1/4 rotation possibilities to see which is working with doctr. – Jason Chia Jul 25 '23 at 12:46
  • @JasonChia Some of your advice is understandable, but not entirely. If you could help with the code, then I would appreciate your advice. – Paul Jul 25 '23 at 13:08
  • Small point that may help you, I know that Tesseract is able to automatically detect rotation and can OCR the text... in this case you only have to convert the pdf into an image. – Oliver M Grech Jul 26 '23 at 07:02
  • @KJ Please tell me why when I write the line of code "from doctr.models import kie_predictor" (as shown in the example), I get the error "ImportError: cannot import name 'kie_predictor' from 'doctr.models'" – Paul Jul 26 '23 at 14:22

5 Answers5

2

These are my thoughts on the proposed problem:

  • If you are scanning the tables from paper, then the document (even if it is pdf format) it contains an image.
  • You know that you need to rotate the document for docTR to read it, but from what I read in docTR repository you could also transform the pdf to image and make docTR scan it as an image.
  • But, why are you making the pdf into an image? I think it might be easier to do the next two steps if the file is an image:
  • First you need to know the angle(amount in degrees, or radians - different for each file) you want to rotate the image. For that, you need to scan the image for "long straight lines" and get their angles (the table borders). You will get many angles, and you only need one, so you might have to get a bit creative there (you could, for example, in the last step scan the file multiple times with docTR for different angles, measuring the success of the result according to the amount of data extracted)
  • Once you have your angle(or angles), you rotate the image file to the specific angle you previously calculated
  • Last step: use docTR to scan the rotated image

I know this is not a snippet, copy-paste solution. Hopefully you find an easier way to get there. But this would be my approach if nothing easier worked.

Mous
  • 953
  • 3
  • 14
Martin
  • 105
  • 10
  • Thank you for paying attention to my question. Your information is very helpful. As for the fifth point in your answer, I would like to know more. Yes, I am faced with the fact that correctly determining the corner of the document is a very difficult task. I left the rest of the problem for later. Perhaps you can help me with a code that will allow you to accurately determine the angle of the document and rotate the document at an angle of zero degrees, for subsequent text recognition. – Paul Jul 23 '23 at 13:03
  • The answer to which you gave a link in the fourth paragraph will determine the angle of inclination of the lines without taking into account the letters. For example: if the document is flipped exactly 180 degrees, then this code will determine this angle of inclination as 0 degrees, since the lines are straight – Paul Jul 23 '23 at 13:27
  • Hi, I am glad that the comment helped. I know the angles are a real challenge. Thinking about it again, an approach that you shouldn't discard completely is to "brute force" finding the correct angle. Imagine: you run the docTR or any other OCR tool 20 or 50 times per page, for 20 or 50 different angles, angles that you simply guess (0, 5, 10, etc.), and for each result you check if the text generated contains your expected content - you could even check specific keywords, if you know that all your tables contain the word "Results" or "Participants" for example. – Martin Jul 24 '23 at 14:00
  • Since I am not an specialist in this type of problem it would take me a lot of troubleshooting and learning new packages to put together the solution. It is interesting, but I just don't have enough time. Maybe someone that specializes in this kind of problems is happy to jump in? – Martin Jul 24 '23 at 14:07
  • Of course, maybe someone else can help. Here I have the main problem, it is to correctly determine the angle of inclination of the text and the entire document as a whole (since my PDF may contain both horizontal and vertical text), in order to correctly understand how to align the file. Turning it straight and recognizing the text is no longer a problem – Paul Jul 24 '23 at 22:03
1

You can use OCRmyPDF, which is very good OCR library:

ocrmypdf --rotate-pages input_scanned.pdf

The flag can fix pages that are misrotated. I have no experience with doctr.

JulianWgs
  • 961
  • 1
  • 14
  • 25
1

You can use a detection model that has been trained on rotated documents and pass the option assume_straight_pages accordingly:

predictor = detection_predictor('db_resnet50_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)

Here is the official documentation.

matleg
  • 618
  • 4
  • 11
  • Thank you for your attention to my question. When using the line of code you provided, I get the following error (ValueError: unknown architecture 'db_resnet50_rotation'). Perhaps you understand what the problem is? – Paul Jul 17 '23 at 09:39
  • I just re-read the doc and unfortunately it seems this option is valid only for pytorch "backend" and not tensorflow (and your question is tagged with TF so I guess you don't have the pytorch requirements installed...). Sorry, this must be the reason: it tries to download a pytorch model that of course does not exists in TF. For info, here is the extract from the doc: NB: for the moment, db_resnet50_rotation is pretrained in Pytorch only and linknet_resnet18_rotation in Tensorflow only. So you can try with the model "linknet_resnet18_rotation" instead of "db_resnet50_rotation". – matleg Jul 17 '23 at 09:45
  • Yes, I read that in the documentation too. Is it possible somehow to successfully use "linknet_resnet18_rotation" in my case – Paul Jul 17 '23 at 09:49
  • I thought it would work just like that: predictor = detection_predictor('linknet_resnet18_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True) If not, sorry, I have no idea... :-( – matleg Jul 17 '23 at 10:01
  • Unfortunately, it`s not work so. A get error (AttributeError: 'list' object has no attribute 'show'). Ок, I will continue find solve further, thank you – Paul Jul 17 '23 at 10:13
  • A positive point is that your initial error is solved! The predictor returns a Document object, you have to export the results. It is in the next steps of the [documentation](https://mindee.github.io/doctr/using_doctr/using_models.html#what-should-i-do-with-the-output). Hopefully it will have read something for the rotated document. – matleg Jul 17 '23 at 10:25
1

Before using DocTR for your task. You can use tesseract OCR to rotate your pdf image as per the alignment of the text. The source code and detailed implementation has been provided here: https://pyimagesearch.com/2022/01/31/correcting-text-orientation-with-tesseract-and-python/

Your flow might look as follows:

  1. Read pdf and get image.
  2. Send image to tesseract ocr for realignment.
  3. Send the response to DocTR for character recognition.
Harris Minhas
  • 702
  • 3
  • 17
  • Thanks for the advice. Yes, of course, I studied this article and tried to apply it. But the result was incorrect. Perhaps because the examples in this article are only text, and my documents can contain both text and tables. – Paul Jul 26 '23 at 08:43
  • I tried your advice, unfortunately the angle of inclination is determined extremely inaccurately – Paul Jul 30 '23 at 20:46
1

Step 1: Convert pdf to image.

Step 2: Read image with opencv

import numpy
import math
import cv2
import matplotlib.pyplot as plt

img = cv2.imread("test.png",0) #grayscale

Step 3: Preprocess if needed. See thresholding etc. (not done cos its not needed for your image example)

Step 4: Use Canny edge detection and Hough lines

dst = cv2.Canny(img, 50, 200, None, 3) #see Canny docs
lines = cv2.HoughLines(dst, 1, np.pi / 180, 150, None, 0, 0) # See docs

Step 5: Convert all your lines angles to degrees and find some best fit.

deg_lines = [round(np.rad2deg(i[0][1]))%90 for i in lines] 
#lines is in format [[rho,theta]]
#we also mod by 90 as the lines should be orthogonal on page. I.E 90degrees
#deg_lines now contains the degree angles of all lines found in the image. 
candidates_angle = round(np.mean(deg_lines)) # or use the median/mode
#candidates_angle now contatins to the nearest degree the current orientation angle of your doc. Rotate it to the correct angle and you should be good. 



cdst = img.copy() #Just to visualize your lines
if lines is not None:
    for i in range(0, len(lines)):
        rho = lines[i][0][0]
        theta = lines[i][0][1]
        a = math.cos(theta)
        b = math.sin(theta)
        x0 = a * rho
        y0 = b * rho
        pt1 = (int(x0 + 1000*(-b)), int(y0 + 1000*(a)))
        pt2 = (int(x0 - 1000*(-b)), int(y0 - 1000*(a)))
        cv2.line(cdst, pt1, pt2, (0,0,255), 3, cv2.LINE_AA)
plt.imshow(cdst)
plt.show()

Step 6: Rotate your image and run your code. Or refer to cv2/PIL libraries for rotating an image you did this already so it should work.

Additional docs:

https://docs.opencv.org/3.4/d9/db0/tutorial_hough_lines.html

https://docs.opencv.org/3.4/da/d22/tutorial_py_canny.html

https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html

Please let me know if you have any additional questions.

Jason Chia
  • 1,144
  • 1
  • 5
  • 18
  • Thank you very much for your time and for providing the code. I tried your code on my files. And here is the result: for example, I have 4 files at angles of about 0 degrees, about 90 degrees, about 180 degrees and about 270 degrees (the values ​​from those indicated may differ visually by 1-4 degrees). – Paul Jul 26 '23 at 12:42
  • As I understand it, the candidates_angle variable is the average value of all angles and I have to start from it in order to know how many degrees to rotate the file for alignment. But candidates_angle in all four cases shows a value from 41 to 49 degrees. That is, tell me how I can understand how many degrees I should rotate the file (since for all four different files the output is approximately the same value). – Paul Jul 26 '23 at 12:43
  • My bounty time is about to expire. I would like to reward you. Please help me to complete this issue. – Paul Jul 26 '23 at 20:03
  • Erm.. Simply add n degrees to 90 I guess. And if that fails.continie for 180,270,360 – Jason Chia Jul 27 '23 at 18:04