Ideas to extract specific invoice pdf data for different formats and convert to Excel

Question

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.

The objectives are the following:

First of all, the data to be extracted would be the following:

https://i.stack.imgur.com/6RLo2.png

In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).

Once this is extracted, I would like to display this in an Excel Spreadsheet.

Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.

Thanks for you help and advice.

Yes, OP may want to run a few tests whether there is such issue in his use case. I am doing this with pdf invoices generated from the system (not scanned) for quite a while and have not faced any accuracy issues so far. Main effort is related to the layout and position of the data on the document. My approach is to extract everything as lines of plain text and then running some regex-based “logics” to get the desired pieces of data. Regex makes it quite manageable to cover “special” cases as well. — Anonymous, Jun 17 '22 at 22:20
Hi, thanks for your reply. Could you please explain with some more detail how you extract your invoice data with Regex? Thanks — johannes orssich, Jun 20 '22 at 09:37
I updated my answer with more details. Basically, get the pdf content to plain text. Then use any tool that supports regex to extract. Regex makes it easier, but you still need to define the expressions/pattern you want to match with (e.g. you may want to look for the line starting with “Activa” and then any numbers followed by “kWh”). Further details depend on how your plain text looks like (and how many variations you might get due to slight differences in the layout of the pdf) — Anonymous, Jun 26 '22 at 09:08

Ignace Vau · Answer 1 · 2022-06-17T18:41:49.013

If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.

This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.

This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.

Anonymous · Answer 2 · 2022-06-20T18:33:10.473

In Excel you may want to use PowerQuery to read the pdf:

https://learn.microsoft.com/en-us/power-query/connectors/pdf

Then you can further process to extract the data you want within PowerQuery.

If you are interested in further data analysis after extraction you may want to consider KNIME as well:

https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8

From there export to Excel is also supported.

edit:

after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.

edit 2: regex in Excel

How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)

K J · Answer 3 · 2022-06-18T00:56:35.153

I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.

I don't have your sample to test so you would need to adjust to suit your bills

pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf

Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)

The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.

Thanks for your help, I will give it a try with the suggested commands — johannes orssich, Jun 20 '22 at 09:45

Ideas to extract specific invoice pdf data for different formats and convert to Excel

3 Answers3