Extract table into csv from scanned PDF by using pytesseract python

Question

I have different type of invoice files, I want to find table in each invoice file. I am able to convert scanned pdf to image by using 'pdf2jpg' method now i have to extract table from each invoices and write into csv file by using OCR pytesseract method. Please help.

You can't get that in pytesseract. Pytesseract is supposed to just extract all the text from a pdf file. [This](https://stackoverflow.com/questions/50829874/how-to-find-table-like-structure-in-image) should be helpful for you. — Siddharth Prajosh, Jan 14 '20 at 10:40
Depending on how the pdf was made, you may be better off using pdf2txt directly, rather than converting to jpg and then trying ocr. If the pdf was scanned from a paper invoice, that won't help, but if it was generated directly you can get the text without having to try to use ocr. — Brian Minton, Jan 14 '20 at 13:22
@Siddharth Prajosh I have already tried that shared link but didn't get relevant result. Please find below code that i am using ----------------------------------------------- — Himanshu, Jan 16 '20 at 10:38
@Siddharth i tried to use the code on shared link but now i am getting error as "AttributeError: 'JpegImageFile' object has no attribute 'make_blob'". Please help — Himanshu, Jan 22 '20 at 09:41
Does this answer your question? [How to extract table as text from the PDF using Python?](https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python) — Eric Ihli, Apr 27 '20 at 02:41

score 1 · Answer 1 · edited Jan 19 '20 at 10:37

1

Perhaps this code will help you:

import pyautogui
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'

text = pytesseract.image_to_string('c:\\screenshot\\test.png')

f = open('c:\\screenshot\\csvfile_1.csv','w')
f.write(text)
f.close()

edited Jan 19 '20 at 10:37

halfer

19,824
17
99
186

answered Jan 14 '20 at 10:21

Hietsh Kumar

1,197
9
17

Extract table into csv from scanned PDF by using pytesseract python

1 Answers1