1

I have different type of invoice files, I want to find table in each invoice file. I am able to convert scanned pdf to image by using 'pdf2jpg' method now i have to extract table from each invoices and write into csv file by using OCR pytesseract method. Please help.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Himanshu
  • 21
  • 1
  • 5
  • 1
    You can't get that in pytesseract. Pytesseract is supposed to just extract all the text from a pdf file. [This](https://stackoverflow.com/questions/50829874/how-to-find-table-like-structure-in-image) should be helpful for you. – Siddharth Prajosh Jan 14 '20 at 10:40
  • Depending on how the pdf was made, you may be better off using pdf2txt directly, rather than converting to jpg and then trying ocr. If the pdf was scanned from a paper invoice, that won't help, but if it was generated directly you can get the text without having to try to use ocr. – Brian Minton Jan 14 '20 at 13:22
  • @Siddharth Prajosh I have already tried that shared link but didn't get relevant result. Please find below code that i am using ----------------------------------------------- – Himanshu Jan 16 '20 at 10:38
  • @Siddharth i tried to use the code on shared link but now i am getting error as "AttributeError: 'JpegImageFile' object has no attribute 'make_blob'". Please help – Himanshu Jan 22 '20 at 09:41
  • Does this answer your question? [How to extract table as text from the PDF using Python?](https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python) – Eric Ihli Apr 27 '20 at 02:41

1 Answers1

1

Perhaps this code will help you:

import pyautogui
import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'

text = pytesseract.image_to_string('c:\\screenshot\\test.png')

f = open('c:\\screenshot\\csvfile_1.csv','w')
f.write(text)
f.close()

Sample Image

halfer
  • 19,824
  • 17
  • 99
  • 186
Hietsh Kumar
  • 1,197
  • 9
  • 17