I have different type of invoice files, I want to find table in each invoice file. I am able to convert scanned pdf to image by using 'pdf2jpg' method now i have to extract table from each invoices and write into csv file by using OCR pytesseract method. Please help.
Asked
Active
Viewed 5,711 times
1
-
1You can't get that in pytesseract. Pytesseract is supposed to just extract all the text from a pdf file. [This](https://stackoverflow.com/questions/50829874/how-to-find-table-like-structure-in-image) should be helpful for you. – Siddharth Prajosh Jan 14 '20 at 10:40
-
Depending on how the pdf was made, you may be better off using pdf2txt directly, rather than converting to jpg and then trying ocr. If the pdf was scanned from a paper invoice, that won't help, but if it was generated directly you can get the text without having to try to use ocr. – Brian Minton Jan 14 '20 at 13:22
-
@Siddharth Prajosh I have already tried that shared link but didn't get relevant result. Please find below code that i am using ----------------------------------------------- – Himanshu Jan 16 '20 at 10:38
-
@Siddharth i tried to use the code on shared link but now i am getting error as "AttributeError: 'JpegImageFile' object has no attribute 'make_blob'". Please help – Himanshu Jan 22 '20 at 09:41
-
Does this answer your question? [How to extract table as text from the PDF using Python?](https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python) – Eric Ihli Apr 27 '20 at 02:41
1 Answers
1
Perhaps this code will help you:
import pyautogui
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string('c:\\screenshot\\test.png')
f = open('c:\\screenshot\\csvfile_1.csv','w')
f.write(text)
f.close()

halfer
- 19,824
- 17
- 99
- 186

Hietsh Kumar
- 1,197
- 9
- 17