I'm parsing some PDFs in Python. These PDFs are visually organized into rows and columns. The pdftohtml script converts these PDFs to an XML format, full of loose <text>
tags which don't have any hierarchy. My code then needs to sort these <text>
tags back into rows.
Since each <text>
tag has attributes like "top" or "left" coordinates, I wrote code to append <text>
items with the same "top" coordinate to a list. This list is effectively one row.
My code first iterates over the page, finds all unique "top" values, and appends them to a tops list. Then it iterates over this tops list. For each unique top value, it searches for all items that have that "top" value and adds them to a row list.
for side in page:
tops = list( set( [ d['top'] for d in side ] ) )
tops.sort()
for top in tops:
row = []
for blob in side:
if int(blob['top']) == int(top):
row.append(blob)
rows.append(row)
This code works great for the majority of the PDFs I'm parsing. But there are cases where items which are on the same row have slightly different top values, off by one or two.
I'm trying to adapt my code to become a bit fuzzier.
The comparison at the bottom seems easy enough to fix. Something like this:
for blob in side:
rangeLower = int(top) - 2
rangeUpper = int(top) + 2
thisTop = int(blob['top'])
if rangeLower <= thisTop <= rangeUpper :
row.append(blob)
But the list of unique top values that I create first is a problem. The code I use is
tops = list( set( [ d['top'] for d in side ] ) )
In these edge cases, I end up with a list like:
[925, 946, 966, 995, 996, 1015, 1035]
How could I adapt that code to avoid having "995" and "996" in the list? I want to ensure I end up with just one value when integers are within 1 or 2 of each other.