Extract products and prices from invoice

Question

I want to extract information from pdfs.

The following is an extract from a policy, where the pdf is converted to a txt document using https://github.com/yob/pdf-reader/.

Vehicle Description          2007, PORSCHE, CAYMAN 3.2

Registration Number          USD-2394                   Vin Number            FSDFKJL23123KFAS


MY COVER DETAILS

Cover                                                                                 USD37.45

I would like to extract e.g. the Vehicle description and cost of cover:

vehicle.description => "2007, PORSCHE, CAYMAN 3.2"
vehicle.registration => "USD-2394"
vehicle.cost_of_cover => "37.45"

Can anyone please advise on the appropriate method. The problem is that the layout of the policy might change but the data will mostly be the same, just with different values.

If regex is the way to go can anyone just provide example code.

It depends how much time or money you are willing to spend, and on how inconsistent your data is. It might be the better solution is to use a service like mechanical turk. As it is now, we can't really give a good answer to your question (have a look at SO's [question guidelines](http://stackoverflow.com/help/asking)). — Zach Kemp, Jun 19 '13 at 22:01
possible duplicate of [Ruby: Reading PDF files](http://stackoverflow.com/questions/773193/ruby-reading-pdf-files) — phoet, Jun 19 '13 at 22:03
Updated my question to hopefully give you a better idea of what I want. — the_dow, Jun 19 '13 at 22:51

score 1 · Accepted Answer · answered Jun 19 '13 at 23:15

Finding the description

/Vehicle Description((?!Registration$).*)Registration/m

Finding the Registration Number

/Registration Number((?!Vin$).*)Vin/m

Finding the cost of cover

/Cover(.*)/m

These are all pretty lazy regex matches. However you did not provide a lot of different samples. But these should get you started.

Example Usage:

match = /Vehicle Description((?!Registration$).*)Registration/m.match(PDFTEXT)

http://www.ruby-doc.org/core-2.0/Regexp.html

score 0 · Answer 2 · answered Jun 19 '13 at 23:31

You can do this pretty easily with Regular Expressions (regexp). Assume that your pdf text is stored in the variable text:

description = text.scan(/Vehicle Description(.*)Registration/m).flatten[0].strip
registration = text.scan(/Registration Number(.*)Vin/m).flatten[0].strip
cover = text.scan(/Cover(.*)/m).flatten[0].strip

Extract products and prices from invoice

2 Answers2