Information extraction from form documents

Question

I'm currently working on a python project where I need to extract some information from pdf document. The list of information to extract is same for all the documents. The pdf are structured documents from various languages and could be assimilated to forms document.

I'm wondering if they is any machine-learning model, or way to allow me to solve this task. :)

Samples of different pdf document Sample 1 Sample 2

I would like to extract both the currency and the initial issue date, so it would give as an input : (EUR,30 Janvier 2013) for the first sample and (EUR,29 January 2009) for the second.

Maxime

Thanks but i know how to extract text from a pdf. My problem here is that i want to automate the processing of my, let's say, invoice, to recover the different key information :) — Maxime Prieur, Mar 05 '20 at 22:03

anonymous developer · Answer 1 · 2020-03-10T20:56:54.300

0

Are you trying to extract information from a specific row/column etc and is the pdf always formatted in the same way for at least most rows/columns? If this is the case you may not need an ML model, and can just use awk or sed.

UPDATE with answer:

First use pdftotext or something similar to parse the pdf into a text file. One you get it into a format like this (newlines are optional) into a file called "yourfile.txt":

Note Currency   EUR
Trade Date  16 January 2009
Initial Issue Date 29 January 2009

Note Currency   EUR
Trade Date  16 January 2009
Initial Issue Date 29 January 2009

Note Currency   EUR
Trade Date  16 January 2009
Initial Issue Date 29 January 2009

you can use

awk '$1 == "Note"{print "\(" $3}' yourfile.txt  > out1
awk '$2 == "Issue" {print ", "$4" " $5" " $6"\)"}' yourfile.txt > out2
paste -d" " out1 out2 > formatted.txt

Your formatted results will now be in a file called formatted.txt.

edited Mar 10 '20 at 20:56

answered Mar 05 '20 at 18:38

anonymous developer

398
3
17

The line location of the informations can change between the docs. However their is generally a column separation for the Key and the value ! – Maxime Prieur Mar 05 '20 at 21:57
Paste a few example lines here and tell me what you want extracted and into which format you want it extracted – anonymous developer Mar 06 '20 at 00:26
I added 2 examples on my subject :) – Maxime Prieur Mar 08 '20 at 17:09
Unfortunately it does not solve the problem of automation.... :/ What I intend to do is build a ENR with Spacy to detect the label "Key" and "Value", afterward the key will be passed to a dict to obtain the "formal" label and then output the following value associated.. :) – Maxime Prieur Mar 14 '20 at 15:57

Information extraction from form documents

1 Answers1