How to extract relevant information from receipt

Question

I am trying to extract information from a range of different receipts using a combination of Opencv, Tesseract and Keras. The end result of the project is that I should be able to take a picture of a receipt using a phone and from that picture get the store name, payment type (card or cash), amount paid and change tendered.

So far I have done a few different preprocessing steps on a series of different sample receipts using Opencv such as removing background, denoising and converting to a binary image and am left with an image such as the following:

I am then using Tesseract to perform ocr on the receipt and write the results out to a text file. I have managed to get the ocr to perform at an acceptable level, so I can currently take a picture of a receipt and run my program on it and I will get a text file containing all the text on the receipt.

My problem is that I don't want all of the text on the receipt, I just want certain information such as the parameters I listed above. I am unsure as to how to go about training a model that will extract the data I need.

Am I correct in thinking that I should use Keras to segment and classify different sections of the image, and then write to file the text in sections that my model has classified as containing relevant data? Or is there a better solution for what I need to do?

Sorry if this is a stupid question, this is my first Opencv/machine learning project and I'm pretty far out of my depth. Any constructive criticism would be much appreciated.

Why do you want to use a neural network to retrieve the information? You already have the text – why can´t you simply do some textfiltering/processing in order to get the wanted words/terms/values? — petezurich, Aug 21 '17 at 13:23
I think I should be using a neural net here because the text can be completely different depending on which store the receipt is from - eg. some stores might say "amount tendered" while another will just say "cash" to denote the amount that has been paid. As this can change from receipt to receipt with no set standard I was under the impression that the problem couldn't be tackled using conventional filtering, as there is too many possible terms for any one particular value for me to hard-code in. — R.E., Aug 21 '17 at 13:40
If you already retrieve the information then it's time to parse and analyze the text. Analyzing text is thousands time easier than analyzing image. Replacing your current approach with neural net will only help you get better text. After that, you still have to analyze the text. You can get text structure from tesseract, use it. — Vu Gia Truong, Jun 21 '18 at 10:01
@R.E. Hi, I am working on a similar problem. May I know which solution you opted for and how did that work out for you? — Zaid Khan, Jan 21 '20 at 13:04

DanyAlejandro · Answer 1 · 2018-06-20T00:14:02.057

My answer isn't as fancy as what's in fashion right now, but I think it works in your case, specially if this is for a product (not for research & publication purposes).

I would implement the paper Text/Graphics Separation Revisited. I have already implemented it in both Matlab & C++ and I guarantee from your description it won't take you long. In summary:

Get all connected components with stats. You're specially interested in the bounding box for each character.
The paper obtains thresholds from histograms on the properties of your connected components, which makes it a bit robust. Using these thresholds (that work surprisingly well) on the geometrical properties of your connected components, discard anything that's not a character.
For your characters, get the centroid for all of their bounding boxes and group the close centroids by your own criteria (height, vertical position, euclidean distance, etc.). Use the obtained centroid clusters to create rectangular text regions.
Associate text regions of same height and vertical position.
Run OCR on your text regions and look for keywords like "Cash". I honestly think you can get away with having dictionaries with text files, and from having done computer vision for mobile I know your resources are limited (by privacy too).

I honestly don't think a neural net will be much better than some kind of keyword matching (e.g. using Levenshtein distance or something similar to add a bit of robustness) because you will need to manually create and label these words anyway to create your training dataset, so... Why not just write them down instead?

That's basically it. You end up with something very fast (specially if you want to use a phone and you can't send images to a server) and it just works. No machine learning needed, so no dataset needed either.

But if this is for school... Sorry I was so rude. Please use TensorFlow with 10,000 manually labeled receipt images and natural language processing methods, your professor will be happy.

score 0 · Answer 2 · answered Mar 06 '18 at 12:20

0

its a good idea to use image, as you will loose the structure of the document if you just you plain OCR. I think you are on right track. I would segment the bill in to headers, total amount, line items and get an image classifier trained on it. Then you could use it to clean/extract relevant information that you need from the text

answered Mar 06 '18 at 12:20

vumaasha

2,765
4
27
41

1

Do you have enough data to build an image classifier? – vumaasha Mar 06 '18 at 12:21

How to extract relevant information from receipt

2 Answers2

Linked