Data extraction from an image using python and Pytesseract

Question

I have an image that contains a table and I am trying to extract data from it and for this process I am using pytesseract as an OCR to help me read data from the image. My problem is that if the rows contain only a single line value in a cell then after reading the data I am able to split at '\n\n' and able to differentiate the data into multiple rows but when I have multiple line values in single cell then I am not able to differentiate what data belongs to which row.

The process I tried is as follows. The input image:

from PIL import Image
from pytesseract import image_to_string

im = Image.open(r'verticaltable.png')
text = image_to_string(im)
print(text)

The output of the program

COLUMN A COLUMN B COLUMN C COLUMN D COLUMN E
1 Prime Minister Rule Country By people 5 years

and president
2 Civil officer Talented people Exam 25-60 years
3 Administrative Documents -

officers maintained
4 Law enforcement | Law Exam 25-60 years

I would like to achieve the output according to the columns(how to extract data going through the columns)

COLUMN A, 1, 2, 3,4
COLUMN B, Prime Minister and president Prime Minister and president,Civil officer,Administrative officers, Law enforcement 
COLUMN C, Rule Country,Talented people,Documents maintained,Law
COLUMN D ,By people,Exam,,Exam
COLUMN E, 5 years,25-60 years,-,25-60 years

Note: I have already gone through this answer and have tired it but still its not working so kindly dont tag this link another answer

Here's an alternate way to look at this problem: instead of one image, you have 25 small images. So you might approach this by: (1) finding lines with something like the Hough transform, (2) [detecting intersections of those lines](https://stackoverflow.com/questions/46565975/find-intersection-point-of-two-lines-drawn-using-houghlines-opencv), then (3) applying OCR on the 25 sub images. — Alexander L. Hayes, Dec 03 '22 at 18:43

Data extraction from an image using python and Pytesseract

0 Answers0