I'm using ocrmypdf. I'm trying to do ocr on campaign finances pdfs. Example pdfs: https://apps1.lavote.net/camp/comm.cfm?&cid=11
My clients wants to parse these pdfs, as well as others (form 496s, form 497s). The problem is, even with forms of the same type, the ocr results are inconsistent.
For example, one pdf (form 460) will yield these results:
Statement covers period
from 07/01/2005
through __11/30/2005
and another of the same type yields:
Statement covers period
01/01/2006
from
through 03/17/2006
Notice in the first, the first date comes after the from
, whereas in the second, the first date comes before the from
. This creates complications when trying to parse the data.
I'm using what I call "checkpoints" to parse forms of similar type. Here's an example:
checkpoints = [
['Statement covers period from', 'Date From'],
['through', 'Date Thru'],
['Date of election if applicable:', None],
['\n', None],
['\\NUMBER Treasurer(s)\n', 'ID'],
['\n', None],
['COMMITTEE NAME (OR CANDIDATE’S NAME IF NO COMMITTEE)\n\n', 'Committee / Candidate Name'],
['\n', None],
['NAME OF TREASURER\n\n', 'Name of Treasurer'],
['\n', None],
['NAME OF OFFICEHOLDER OR CANDIDATE\n\n', 'Name of Officeholder or Candidate'],
['\n', None],
['OFFICE SOUGHT OR HELD (INCLUDE LOCATION AND DISTRICT NUMBER IF APPLICABLE)\n\n', 'Office Sough or Held'],
['\n', None],
]
I loop through every checkpoint, find the start index and end (using current checkpoint and next) index of the current iteration, [0] and not [1], and I save the contents to a key in a master object, like county_object[checkpoint[1]] = contents[start_index:end_index]
.
This setup only works specifically for the pdf I am parsing. Because ocrmypdf yields different results for even same form types, my setup is not ideal. Can someone point me in the right direction on how I should go about this?
Thanks