How to OCR scanned voting protocols

Question

As part of a hobby project I'm trying to digitalise all the voting records of the Swedish parliament to see if I can extract any interesting statistics (yes a strange hobby I know).

From 1983 to 2001 the voting records look something like in the example. They are printed from some kind of voting machine and only exist on paper (that are now scanned and on my disk).

Every vote consists of three pages with two columns each of members and votes as in the example. Some translations from Swedish: (Plats = Seat, Ledamöter = Members, Parti = party, Röst = Vote).

The list is sorted on the party column and then alphabetically on the member column. After an election a member stays on their seat until next election or that the member quits parliament and is replaced by a replacement member. There can also be temporary replacements. Replacements are also sorted into the list.

The Parti/Party column contains the party abbreviation and can only be one of nine letters (since there are only nine different parties in that time period). Members stay with their party but can technically change between or during election periods.

The Röst/vote column can be one of J,N,A,F (Ja = Yey, Nej = Ney, Avstår = Pass, Frånvarande = Absent) and are also aligned in a four sub-columns.

The columns and rows are not always in the same place in the picture. It can both be slightly translated and/or rotated. The quality of the scan is also not always this good.

At the top of the first page (not shown i the example) there is a summation of votes per party that can be checked against.

There are about 18.000 votes in total.

For votes before to 1983 the layout was different and I was able to make a custom program in node.js (although most languages would be ok for me) that semi-automatically could scan the votes but this looks like something that should be easy to do with tesseract or something similar.

My question is really if its possible to hint tesseract about the layout so that it can do some better guesses of what the text is. I'm aware one can make a custom wordlist (where I could for example add all members names manually).

I'm guessing that there might be a way to make a custom pattern list but I haven't figured that out.

Does anyone have any good suggestions on how to tackle this?

Interesting problem. I'm wondering if you could use the hole punches to calibrate your images? If the holes are always there, you could detect them with Hough circles in OpenCV. AFAIK, tesseract is not that good for page layout. In the Python API, they have a image_to_boxes function which gives you the location of every character. This, combined with calibration, might allow you to figure out which characters are names and which characters are votes. — bfris, Nov 18 '20 at 21:00
I checked a few samples and the holes are unfortunately not always visible. I briefly looked at Apples VisionKit for MacOS (and iOS) and it seems like that might be an avenue of research. It can also return boxes of both characters and words so maybe its possible to somehow relate the boxes with the expected data. — potmo, Nov 18 '20 at 22:03

How to OCR scanned voting protocols

0 Answers0