I have an image that contains a table and I am trying to extract data from it and for this process I am using pytesseract as an OCR to help me read data from the image. My problem is that if the rows contain only a single line value in a cell then after reading the data I am able to split at '\n\n' and able to differentiate the data into multiple rows but when I have multiple line values in single cell then I am not able to differentiate what data belongs to which row.
The process I tried is as follows. The input image:
from PIL import Image
from pytesseract import image_to_string
im = Image.open(r'verticaltable.png')
text = image_to_string(im)
print(text)
The output of the program
COLUMN A COLUMN B COLUMN C COLUMN D COLUMN E
1 Prime Minister Rule Country By people 5 years
and president
2 Civil officer Talented people Exam 25-60 years
3 Administrative Documents -
officers maintained
4 Law enforcement | Law Exam 25-60 years
I would like to achieve the output according to the columns(how to extract data going through the columns)
COLUMN A, 1, 2, 3,4
COLUMN B, Prime Minister and president Prime Minister and president,Civil officer,Administrative officers, Law enforcement
COLUMN C, Rule Country,Talented people,Documents maintained,Law
COLUMN D ,By people,Exam,,Exam
COLUMN E, 5 years,25-60 years,-,25-60 years
Note: I have already gone through this answer and have tired it but still its not working so kindly dont tag this link another answer