3

Got a Tabular data in an image format (see pic1)

Sample

The tabular data need to be extracted and saved it in CSV format (same as table)

I have used pytesseract to read the data from an image and it partially worked code:

from PIL import Image
from ast import literal_eval
import pytesseract,csv,re,os
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

result = pytesseract.image_to_string(Image.open(r'D:\Sample.jpg'),lang="eng")

#print(type(result))
print(result)

with open('D:\people.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    #writer.replace(",", "")
    writer.writerow(result)

string = open('D:\people.csv').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('D:\people.csv', 'w').write(new_str)

output:

capture

The output file is opened in text format and I am not able to get the perfect csv format (i.e. like the table in image).

Any help would be appreciated. TIA

RSK Rao
  • 193
  • 2
  • 14
  • 2
    This is a well-known, but yet not generally solved problem from the field of of data mining named Table Detection. I am guessing there will most certainly not be a ready-made solution. – mfit Sep 15 '18 at 10:32
  • 1
    Oh..ok. While I am trying to save it directly as csv file, the data is populated as 'R','E','P','O',...and each cell gets each character. – RSK Rao Sep 15 '18 at 10:36
  • 1
    I'm facing the same problem, is there a solution for this? – sureshvignesh May 31 '19 at 07:25

0 Answers0