13

I am looking for a solution to extract data from my invoices to send a summary to my accountant.

There are some companies out there which provide such services for around 20€ a month and invoices are usually very well recognised. But the services I tried don't extract all data I like, or are missing some functionality like an excel export to send the data to my accountant. And paying 20€ a month and having to manage another service for 5 invoices per month didn't appeal to me yet.

I was researching a little bit and found this stackoverflow question: Can anyone recommend OCR software to process invoices?

It's a bit outdated and hope to find some more up to date recommendations. I tried the Ephesoft community edition and it looked very promising at first. But the software has a learning and a review step. Inside the review step the data doesn't seem to be fed back to the learning step. Plus it feels more cumbersome then just doing it by hand. I assume it's made for big businesses.

I am looking for a simple data extraction software, which learns with each step I show it.

I also had a look at Apache Tika, but it doesn't seem ready to use with a simple web-interface.

  1. Do you have some recommendation for payed OCR services? Flexible to extract Total VAT amount/VAT %/ Total Amount/ Total Amount Currency/ VAT Currency/ Which account it was payed with/ Company name. With an export to excel?

  2. Do you have some recommendations for open source software?

  3. Do you have some general advice of how you handle your few (less than 50 a year) invoices?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Toby
  • 229
  • 1
  • 2
  • 7
  • Btw what are the steps/ best practices if someone to build something like verify or nanonets from scratch without using any 3rd party APIs? – LordDraagon Dec 14 '20 at 06:38

5 Answers5

9

Except raw OCR and regexes on top of that (which may work fine for some very limited use-cases), there are several other options which offer API access. Those you can actually start using without any demo or sales process:

  • TagGun - specialized on receipts, can extract line-items too, free for 50 receipts monthly
  • Elis - specialized on invoices, supports a wide variety of templates automatically (a pre-trained machine learning model), free for under 300 invoices monthly

If you are willing to go through the sales process (and they actually seem to be real and live):

  • LucidTech and Itemize (not sure what their accuracy is and what are the fields they extract, as their API details are non-public)
  • FlexiCapture Engine - based on templates, if you are willing to define one for each specific invoice format

(disclaimer: I'm affiliated with Rossum, the vendor of Elis. Feel free to suggest edits adding other APIs!)

Petr Baudis
  • 1,178
  • 8
  • 13
3

Sypht provide an API to do this: http://www.sypht.com.

Python client: https://github.com/sypht-team/sypht-python-client

Step 1

pip install sypht

Step 2

from sypht.client import SyphtClient, Fieldset

sc = SyphtClient('<client_id>', '<client_secret>')

with open('invoice.png', 'rb') as f:
    fid = sc.upload(f, fieldsets=["document, "invoice"])

print(sc.fetch_results(fid))

Disclaimer: I am affiliated with the vendor

2

Check out Veryfi It extracts 50+ fields from receipts and invoices including line items in 3-5seconds.

It’s ready to use out of the box (ie. no need to train it) with high accuracy results and supports over 30 languages/regions.

> pip install veryfi
veryfi_client = Client(client_id, client_secret, username, api_key)

categories = ['Grocery', 'Utilities', 'Travel'] # list of your categories

file_path = '/tmp/invoice.jpg'

response = veryfi_client.process_document(file_path, categories=categories)

print (response)

Here is detailed overview of how to use it: https://www.veryfi.com/engineering/invoice-data-capture-api/

*I'm a co-founder of Veryfi, so do not hesitate to ask any questions

1

Extracting data from invoices is a complex problem. I didn't see any open source solutions yet. OCR is just one part of the data extraction process. You need image preprocessing, AI engine for data recognition, etc.

You have many solutions to solve this problem. Every one of them is a bit different. @Peter Baudis already mentioned some of them.

They go from very simple:

  • OCR SPACE Receipt scanning - extract data in a table format, but you still need to parse them and determine which part of a text is e.g., invoice number

To more advanced:

  • Nanonets - Machine learning API many solutions (invoices, tax forms, ...)
  • typless - single call API for any document (invoices, purchase orders, ...), free for 50 invoices per month
  • Parascript - templating system, similar to Abby FlexiCapture

It is important to know what is your use case. There is no one-fits-all solution. It depends on what are you trying to achieve:

  • Data mining - It must be cheap and fast. Missing or incorrect data are not mission-critical. You can clean it in data analysis.

  • Automation in an enterprise - Trained repeating invoices must work almost 100%. Speed and new invoices are not mission-critical.

  • Automation in e.g., customs - It is essential that as much data is returned as it gets. Accuracy on the whole set is vital, but probably every document will be reviewed anyway.

Therefore, you should test them and see how they fit into your process/needs.

Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Jan Giacomelli
  • 1,299
  • 11
  • 23
1

You can try out Nanonets, there is a sample in this Github repo:

https://github.com/NanoNets/invoice-processing-with-python-nanonets

import requests, os, sys, json

model_id = "Your Model Id"
api_key = "Your API Key"
image_path = "Invoice Path"

url = 'https://app.nanonets.com/api/v2/ObjectDetection/Model/' + model_id + '/LabelFile/'

data = {'file': open(image_path, 'rb'),    'modelId': ('', model_id)}

response = requests.post(url, auth=requests.auth.HTTPBasicAuth(api_key, ''), files=data)

print(json.dumps(json.loads(response.text), indent = 2))
sj7
  • 1,291
  • 1
  • 12
  • 26