5

I'm trying to extract data from pdf/image invoices using computer vision.For that i used ocr based pytesseract. this is sample invoice enter image description here you can find code for same below

import pytesseract


img = Image.open("invoice-sample.jpg")

text = pytesseract.image_to_string(img)

print(text)

by using pytesseract i got below output

http://mrsinvoice.com

 

’ Invoice

Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334


BILLTO:

fofin Oe Invoice # 00001

Alpha Bravo Road 33 Invoice Date 32/12/2001

P: 111-292-333, F: 111-222-334

client@example.net Nomecof Reps Bob
Contact Phone 101-102-103

SHIPPING TO:

eine ce Payment Terms ash on Delivery

Office Road 38
P: 111-333-222, F: 122-222-334 Amount Due: $4,170
office@example.net

NO PRODUCTS / SERVICE QUANTITY / RATE / UNIT AMOUNT
HOURS: PRICE

1 tye 2 $20 $40

2__| Steering Wheel 5 $10 $50

3 | Engine oil 10 $15 $150

4 | Brake Pad 24 $1000 $2,400

Subtotal $275

Tax (10%) $27.5

Grand Total $202.5

‘THANK YOU FOR YOUR BUSINESS

but problem is i want to extract text and segregate it into different parts like Vendor name, Invoice number, item name and item quantity. expected output

{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}

I also tried invoice2data python library but again it has many limitation. I also tried regex and opencv's canny edge detection for detecting text boxes separately but failed to achieve the expected outcome

could you guys please help me

Christoph Rackwitz
  • 11,317
  • 4
  • 27
  • 36
Mayur Satav
  • 985
  • 2
  • 12
  • 32
  • 2
    Is the image format a standard? Would the tables, Name, etc. appear at the same locations as in the attached sample ? – ZdaR Apr 17 '20 at 07:28
  • No there is no standard format for that because different product sellers has their different invoice formats.but if what you said is possible ,then it will also work. – Mayur Satav Apr 17 '20 at 13:35
  • I think instead of `tesseract`, you shall use Google Vision API, which will return the bounding box of various paragraphs along with the OCR text, and then you can build some regex kind of stuff to detect the address block or name block or tabular data, etc. – ZdaR Apr 17 '20 at 13:42
  • 2
    How would regex be useful if there is no format? Is it not hardcoding to use regex? Can you suggest something related to ML? – paradocslover Apr 17 '20 at 14:52
  • @MayurSatav - Having the same issue as you have faced, So have you received any Solution for it , Please suggest as checked but not getting relevant solutions. – Manz Oct 19 '20 at 08:01

2 Answers2

4

You must do more processing, especially because BILL TO and SHIPPING TO are not aligned with the invoice table. But you can use following code as a base.

import cv2
import pytesseract
from pytesseract import Output
import pandas as pd

img = cv2.imread("aF0Dc.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

custom_config = r'-l eng --oem 1 --psm 6 '
d = pytesseract.image_to_data(thresh, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num'] == block]
    sel = curr[curr.text.str.len() > 3]
    # sel = curr
    char_w = (sel.width / sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left'] / char_w > prev_left + 1:
            added = int((ln['left']) / char_w) - prev_left
            text += ' ' * added
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

The result

                                                                                             bhttps//mrsinvoice.com 
                                                                                                  Lp 
                  I                                                                              | 
        Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334 
        BILL TO: 
        P: 111-222-333, F: 111-222-334                          m                              . 
        dlent@ccomplent 
                                                          Contact Phone                  101-102-103 
        john Doe office                                   ayment  Terms                  ash on Delivery 
        Office Road 38 
        P: 111-833-222, F: 122-222-334                            Amount     Due:   $4,170 
        office@example.net 
          NO  PRODUCTS  / SERVICE                                   QUANTITY  /   RATE / UNIT       AMOUNT 
                                                                        HOURS,         PRICE 
         1  | tyre                                                           2           $20             $40 
         2  | Steering Wheet                                                 5          $10              $50 
         3  | Engine ol                                                     40          $15             $150 
         4  | Brake Pad                                                     2a         $1000          $2,400 
                                                                                     Subtotal           $275 
                                                                                    Tax (10%)          $275 
                                                                                  Grand Total         $302.5 
                                            ‘THANK YOU  FOR YOUR BUSINESS 
us2018
  • 603
  • 6
  • 11
0

You can use my code found here : https://github.com/josephmulindwa/Table-OCR-Extractor/blob/main/Example.pdf.

There is an attached PDF that will show you how to extract your table with ease.

Anil Myne
  • 79
  • 3