Optical Character Recognition on PDFs (python)

Question

I'm using ocrmypdf. I'm trying to do ocr on campaign finances pdfs. Example pdfs: https://apps1.lavote.net/camp/comm.cfm?&cid=11

My clients wants to parse these pdfs, as well as others (form 496s, form 497s). The problem is, even with forms of the same type, the ocr results are inconsistent.

For example, one pdf (form 460) will yield these results:

Statement covers period

from 07/01/2005

through __11/30/2005

and another of the same type yields:

Statement covers period

01/01/2006

from

through 03/17/2006

Notice in the first, the first date comes after the from, whereas in the second, the first date comes before the from. This creates complications when trying to parse the data.

I'm using what I call "checkpoints" to parse forms of similar type. Here's an example:

checkpoints = [
        ['Statement covers period from', 'Date From'],
        ['through', 'Date Thru'],
        ['Date of election if applicable:', None],
        ['\n', None],
        ['\\NUMBER Treasurer(s)\n', 'ID'],
        ['\n', None],
        ['COMMITTEE NAME (OR CANDIDATE’S NAME IF NO COMMITTEE)\n\n', 'Committee / Candidate Name'],
        ['\n', None],
        ['NAME OF TREASURER\n\n', 'Name of Treasurer'],
        ['\n', None],
        ['NAME OF OFFICEHOLDER OR CANDIDATE\n\n', 'Name of Officeholder or Candidate'],
        ['\n', None],
        ['OFFICE SOUGHT OR HELD (INCLUDE LOCATION AND DISTRICT NUMBER IF APPLICABLE)\n\n', 'Office Sough or Held'],
        ['\n', None],
    ]

I loop through every checkpoint, find the start index and end (using current checkpoint and next) index of the current iteration, [0] and not [1], and I save the contents to a key in a master object, like county_object[checkpoint[1]] = contents[start_index:end_index].

This setup only works specifically for the pdf I am parsing. Because ocrmypdf yields different results for even same form types, my setup is not ideal. Can someone point me in the right direction on how I should go about this?

Thanks

This is a hard problem to solve, I'd even say virtually impossible without NLP. There's plenty of companies built on solving basically this exact problem (extracting the data you want event with slight variations in the document and/or OCR results). Source: worked in R&D for one. — Eric Le Fort, Sep 07 '20 at 17:56

score 1 · Accepted Answer · answered Sep 07 '20 at 18:17

I imagine the difference between "identical" Form 460's is a vertical misalignment due to one being scanned at a slight CW angle and another at a slight CCW angle. I hope you are invoking with --deskew, but even with that there may be minor aberrations that prove troublesome.

The vertical separation between the dates seems large and robust, so one date will precede the other in the proper way. Consider focusing more on the mm/dd/yyyy pattern and less on the text anchors.

You can obtain bound box coordinates from Tesseract OCR. Use them to disambiguate dates, based on your knowledge of what appears higher or lower on the form, and by (approximately) how much.

Optical Character Recognition on PDFs (python)

1 Answers1