0

I'm parsing some PDFs in Python. These PDFs are visually organized into rows and columns. The pdftohtml script converts these PDFs to an XML format, full of loose <text> tags which don't have any hierarchy. My code then needs to sort these <text> tags back into rows.

Since each <text> tag has attributes like "top" or "left" coordinates, I wrote code to append <text> items with the same "top" coordinate to a list. This list is effectively one row.

My code first iterates over the page, finds all unique "top" values, and appends them to a tops list. Then it iterates over this tops list. For each unique top value, it searches for all items that have that "top" value and adds them to a row list.

for side in page:
    tops = list( set( [ d['top'] for d in side ] ) )
    tops.sort()
    for top in tops:
        row = []
        for blob in side:
            if int(blob['top']) == int(top):
                row.append(blob)
        rows.append(row)

This code works great for the majority of the PDFs I'm parsing. But there are cases where items which are on the same row have slightly different top values, off by one or two.

I'm trying to adapt my code to become a bit fuzzier.

The comparison at the bottom seems easy enough to fix. Something like this:

        for blob in side:
            rangeLower = int(top) - 2
            rangeUpper = int(top) + 2
            thisTop = int(blob['top'])
            if rangeLower <= thisTop <= rangeUpper :
                row.append(blob)

But the list of unique top values that I create first is a problem. The code I use is

    tops = list( set( [ d['top'] for d in side ] ) )

In these edge cases, I end up with a list like:

[925, 946, 966, 995, 996, 1015, 1035]

How could I adapt that code to avoid having "995" and "996" in the list? I want to ensure I end up with just one value when integers are within 1 or 2 of each other.

Kirkman14
  • 1,506
  • 4
  • 16
  • 30
  • in the event you'd have `1,2,3,4,5` in your list, which ones would you choose? 1 and 4 ? 1 and 5 ? 2 and 5 ? 3? – njzk2 Apr 17 '14 at 18:20
  • In the PDFs I'm parsing, rows are consistently spaced at least 20 units apart, so I don't think I would end up with such a list. – Kirkman14 Apr 17 '14 at 18:30

2 Answers2

4
  • Sort the list to put the close values next to one another
  • Use reduce to filter the value depending on the previous value

Code:

>>> tops = [925, 946, 966, 995, 996, 1015, 1035]
>>> threshold = 2
>>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
[925, 946, 966, 995, 1015, 1035]

With several contiguous values:

>>> tops = range(10)
>>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
[0, 3, 6, 9]

Edit

Reduce can be a little cumbersome to read, so here is a more straightforward approach:

res = []
for item in sorted(tops):
    if len(res) == 0 or item > res[-1] + threshold:
        res.append(item)
njzk2
  • 38,969
  • 7
  • 69
  • 107
  • Can you walk me through the reduce() line in your first code sample? That looks like what I need, I just want to understand what's going on. – Kirkman14 Apr 17 '14 at 18:32
  • 1
    reduce works like this. It takes a first value (in this case the last parameter, `[]` empty list), then calls the lambda with x being that value, and y being the first value of the list. It then repeats the call with x being the result of the previous call, and y the next item in the list. It returns the final result. For example, `reduce(lambda x,y: x+[y], tops, [])` makes a copy of `tops` – njzk2 Apr 17 '14 at 18:34
  • The content of the test in the lambda only appends `y` if it is greater than the previous element + threshold. This condition is sufficient because the list is monotonically growing. (`len(x) == 0` is added to the test to account for the first iteration.) – njzk2 Apr 17 '14 at 18:36
  • the `a if condition else b` is the ternary notation in python. It returns a if the condition is true or b if the condition is false. – njzk2 Apr 17 '14 at 18:37
0

@njzk2's answer works too, but this function actually shows what is going on and is easier to understand:

>>> def sort(list):
...     list.sort() #sorts in ascending order
...     x = range(0, len(list), 1) #gets range
...     x.reverse() #reverses
...     for k in x:
...             if list[k]-1 == list[k-1]: #if the list value -1 is equal to the next,
...                     del(list[k-1])     #remove it
...     return list #return
... 
>>> tops = [925, 946, 966, 995, 996, 1015, 1035]
>>> sort(tops)
[925, 946, 966, 996, 1015, 1035]
>>> 
njzk2
  • 38,969
  • 7
  • 69
  • 107
A.J. Uppal
  • 19,117
  • 6
  • 45
  • 76
  • `del(list[k-1])` is highly unefficient. You don't need to `reverse`, just use `range(len(tops) -1, -1, -1)`. If you are going to use `reverse`, use `reversed`, which returns an iterator, rather than performing the complete in place reversal process. The question mentions `within 1 or 2 of each other`, so 1 is insufficient. – njzk2 Apr 17 '14 at 18:44
  • also, you are iterating on 1 too many item, as k goes as low as 0. (which will return an empty list if you test for equal items and all items are equal) – njzk2 Apr 17 '14 at 18:46
  • also, you really should never name a list `list` – njzk2 Apr 17 '14 at 18:46
  • also, this modifies (in-place sort) the input list. It may not be a wanted effect. (although the function name does say `sort`) – njzk2 Apr 17 '14 at 18:51